[Corpora-List] ANC, FROWN, Fuzzy Logic

Rob Freeman lists at chaoticlanguage.com
Fri Jul 28 07:32:33 UTC 2006


It does not matter that it is possible to gzip text, Mike. Factoring out a few 
repetitions and redundancies doesn't stop text being text. The point is that 
our failure to find a complete grammar for any language means that on some 
level natural language text is Kolmogorov complex, you can't compress text 
below the level of text, and we need to deal with that to be able to move 
forward.

If you find a complete grammar (compression) let me know. If you keep getting 
fuzzy, incomplete, classes, you may wish to consider this explanation.

As Jim says:

"No practicing linguist can have failed to notice that *none* of the rules 
which they have come up with is without exception. I can't think of a single 
rule I have ever come into contact with that doesn't have some exceptions..."

Now, as Jim points out, the classic generativist postion is that this doesn't 
mean the rule is invalid (in the sense there is some basis for the 
generalization.) Sure, but it also doesn't mean that you can keep only the 
rule and not lose information. The two are not the same.

Talk about throwing the baby out with the bath water. If you try to throw out 
the text and just keep the rules, you lose all that information about 
"exceptions" (and other, equally valid, generalizations which can be made 
over those "exceptions".)

The evidence that you can't compress text using rules is legion. Jim's message 
is full of it. It is just he presents it as reasons we should not expect too 
much of our theory (by which he means generativism), rather than as evidence 
we should change our theory.

It is quite fun to go through Jim's message and pull out all the evidence of 
grammatical incompleteness, presented as a justification for lack of 
theoretical rigor. Jim's definition of "perfectly useful" is that technology 
should fail by about the same amount as humans at tasks which can be shown to 
be ill-defined (e.g. tagging, for which Jim gives a theoretical limit on 
accuracy at 99%, about one sentence in 10, and "precisely some of the 
linguistically most interesting cases" -- 99% is modest. Ken Church published 
a figure of 97% agreement between humans 14 years ago.) This is followed by 
the lament that we should be able to do better, if only we could think of a 
way to make our algorithms more 'human-like' (but that he, Jim, will not be 
changing.)

No offense intended Jim. You present a good summary of the problem as seen 
from the wrong, that is generativist, point of view. You are not the only Old 
Dog unwilling to change, even when presented with that more 'human-like' 
algorithm. At least you are up front about it.

Anyway, that is Jim's evidence for grammatical 
incompleteness/incompressibility. What was your other point, Mike? That all 
the evidence shows language learners themselves perform "less than maximal" 
compression. If you mean they retain lots of performance data verbatim, I 
agree totally. That is rather my point. What is at issue is whether that 
verbatim text is actually less compressed than rules or more (that is, 
whether text itself is K-complex.)

Why would learners fail to abstract text into rules if it were possible to do 
so without losing information they need?

Nature is always lazy, but never inefficient. The failure of learners to 
abstract text into rules is just more evidence that text itself is actually 
K-complex.

I'd better stop. This message is too long.

-Rob

On Friday 28 July 2006 06:39, Mike Maxwell wrote:
> Rob Freeman wrote:
> > Mike - I'm not sure what you are saying, other than that linguists have
> > been careless about fitting theory to the data.
>
> Let me put it more bluntly.  I'm saying that when you say things like
>
>  > No single grammar of natural language can ever be complete.
>  > This is because natural language text is at some level
>  > Kolmogorov complex.
>
> you're wrong, or arguing against a straw man, or both.  You're wrong if
> you mean that any particular finite natural language text is K-complex,
> because it isn't; the fact that you can zip English files and get
> smaller files is enough to show that (as others have pointed out in this
> discussion).  And John Goldsmith's response told how his work on
> morphology induction was based on a form of compression, which also
> depends on text being non-K-complex.
>
> Besides, most linguists (particularly generative linguists) do not
> consider coverage of a finite corpus (if that's what you mean by
> "natural language text") to be a goal, at least when it comes to syntax.
>   So in this case you're arguing against a straw man.
>
> As for linguists being careless about fitting the theory to the data,
> that has of course happened, but I wasn't talking about that.  I was
> rather saying that the issue of the degree of compression of natural
> language that humans do in the process of language learning might be
> less than maximal.  So if by "careless" you mean "not doing maximal
> compression", then human language learners may well be worse offenders
> than linguists.
>
> I am tempted to say more, but I'll stop there.



More information about the Corpora mailing list