[Corpora-List] ANC, FROWN, Fuzzy Logic
Rob Freeman
lists at chaoticlanguage.com
Fri Jul 28 07:32:33 UTC 2006
It does not matter that it is possible to gzip text, Mike. Factoring out a few
repetitions and redundancies doesn't stop text being text. The point is that
our failure to find a complete grammar for any language means that on some
level natural language text is Kolmogorov complex, you can't compress text
below the level of text, and we need to deal with that to be able to move
forward.
If you find a complete grammar (compression) let me know. If you keep getting
fuzzy, incomplete, classes, you may wish to consider this explanation.
As Jim says:
"No practicing linguist can have failed to notice that *none* of the rules
which they have come up with is without exception. I can't think of a single
rule I have ever come into contact with that doesn't have some exceptions..."
Now, as Jim points out, the classic generativist postion is that this doesn't
mean the rule is invalid (in the sense there is some basis for the
generalization.) Sure, but it also doesn't mean that you can keep only the
rule and not lose information. The two are not the same.
Talk about throwing the baby out with the bath water. If you try to throw out
the text and just keep the rules, you lose all that information about
"exceptions" (and other, equally valid, generalizations which can be made
over those "exceptions".)
The evidence that you can't compress text using rules is legion. Jim's message
is full of it. It is just he presents it as reasons we should not expect too
much of our theory (by which he means generativism), rather than as evidence
we should change our theory.
It is quite fun to go through Jim's message and pull out all the evidence of
grammatical incompleteness, presented as a justification for lack of
theoretical rigor. Jim's definition of "perfectly useful" is that technology
should fail by about the same amount as humans at tasks which can be shown to
be ill-defined (e.g. tagging, for which Jim gives a theoretical limit on
accuracy at 99%, about one sentence in 10, and "precisely some of the
linguistically most interesting cases" -- 99% is modest. Ken Church published
a figure of 97% agreement between humans 14 years ago.) This is followed by
the lament that we should be able to do better, if only we could think of a
way to make our algorithms more 'human-like' (but that he, Jim, will not be
changing.)
No offense intended Jim. You present a good summary of the problem as seen
from the wrong, that is generativist, point of view. You are not the only Old
Dog unwilling to change, even when presented with that more 'human-like'
algorithm. At least you are up front about it.
Anyway, that is Jim's evidence for grammatical
incompleteness/incompressibility. What was your other point, Mike? That all
the evidence shows language learners themselves perform "less than maximal"
compression. If you mean they retain lots of performance data verbatim, I
agree totally. That is rather my point. What is at issue is whether that
verbatim text is actually less compressed than rules or more (that is,
whether text itself is K-complex.)
Why would learners fail to abstract text into rules if it were possible to do
so without losing information they need?
Nature is always lazy, but never inefficient. The failure of learners to
abstract text into rules is just more evidence that text itself is actually
K-complex.
I'd better stop. This message is too long.
-Rob
On Friday 28 July 2006 06:39, Mike Maxwell wrote:
> Rob Freeman wrote:
> > Mike - I'm not sure what you are saying, other than that linguists have
> > been careless about fitting theory to the data.
>
> Let me put it more bluntly. I'm saying that when you say things like
>
> > No single grammar of natural language can ever be complete.
> > This is because natural language text is at some level
> > Kolmogorov complex.
>
> you're wrong, or arguing against a straw man, or both. You're wrong if
> you mean that any particular finite natural language text is K-complex,
> because it isn't; the fact that you can zip English files and get
> smaller files is enough to show that (as others have pointed out in this
> discussion). And John Goldsmith's response told how his work on
> morphology induction was based on a form of compression, which also
> depends on text being non-K-complex.
>
> Besides, most linguists (particularly generative linguists) do not
> consider coverage of a finite corpus (if that's what you mean by
> "natural language text") to be a goal, at least when it comes to syntax.
> So in this case you're arguing against a straw man.
>
> As for linguists being careless about fitting the theory to the data,
> that has of course happened, but I wasn't talking about that. I was
> rather saying that the issue of the degree of compression of natural
> language that humans do in the process of language learning might be
> less than maximal. So if by "careless" you mean "not doing maximal
> compression", then human language learners may well be worse offenders
> than linguists.
>
> I am tempted to say more, but I'll stop there.
More information about the Corpora
mailing list