[Corpora-List] ANC, FROWN, Fuzzy Logic

Tue Jul 25 16:30:43 UTC 2006

Rob,

As you know, I have a great deal of sympathy for the idea
of using corpora in various ways in language analysis.

There is also growing evidence that the number of rules
needed to parse a corpus does not seem to converge.
Like the vocabulary of any language, whose distribution
has a very long tail, the distribution of grammar rules
(or any other kind of language description) also has
a very long tail.

 > Otherwise put, that experimental observations are the
 > most compact representations for many systems.

But that does not imply that natural language data cannot
be compressed.  The fact that the curve of the number of
rules (or whatever kind of description you prefer) falls
off very rapidly near the beginning means that language data
can be compressed, even though perfect compression may be
impossible.

As soon as you admit that corpus data can be compressed,
Chaitin's arguments imply that some algorithm for doing the
compression must exist.  The goal of linguistics is to find
a more humanly readable characterization of that algorithm
than the bit pattern of a computer program.

 > ... people need to accept that for some (Chaitin/Kolmogorov
 > tell us most) systems the experimental facts are the most
 > compact representation.

But there is abundant evidence that NL data can be compressed.
The fact that a two-year-old child can learn any natural language
very rapidly implies that the corpus is highly compressible and
that a relatively small sampling is adequate to make good
predictions about the whole.  The predictions are not 100%
reliable, however, because adults are constantly learning
(and inventing) new words and new grammatical constructions.

I certainly admit that any set of rules (or other concise
characterization of NL data) must be supplemented with data
from a corpus.  I will also admit that for any corpus of
any given size, new data will have to be added from time
to time.  However, the fact that people can successfully
use language, starting in early childhood, implies that
it's possible to start with a corpus that is much, much
smaller than totality and add more data as needed.

John