[Corpora-List] ANC, FROWN, Fuzzy Logic

Wed Jul 26 06:58:43 UTC 2006

On Wednesday 26 July 2006 00:30, John F. Sowa wrote:
> 
>  > Otherwise put, that experimental observations are the
>  > most compact representations for many systems.
>
> But that does not imply that natural language data cannot
> be compressed ... even though perfect compression may be
> impossible.

To say "perfect compression may be impossible" is to concede the point I wish 
to make.

> But there is abundant evidence that NL data can be compressed.

I see no evidence NL data can be compressed "completely". On the contrary, the 
evidence indicates to me that any compression of NL data must be 
"incomplete" (and each incomplete compression involves a loss of information 
which can only be prevented by retaining the whole corpus anyway.)

We've been running around for 50 years or more finding incomplete 
compressions. You would think we'd get the hint.

Reading Chaitin helps us understand this is normal, and representative of a 
much broader "problem" in science (actually a solution, because it provides a 
path to much greater representational power than any single set of rules 
could provide.)

This doesn't mean compression of NL data to find rules or classes is wrong. 
(The news is good. Most of the actual computational machinery we've been 
using to analyze corpora is still useful.) It just means any abstraction of 
NL data into a system of rules or classes must be seen as specific to a 
purpose.

Understanding this is the key unlocking the power of the corpus.

Knowing this means we can find the abstraction (grammar) relevant to any 
purpose we choose: make a given parsing decision, agree on the significance 
of a word in a given context. Not knowing this means we constantly swim 
around trying to find a single abstraction to fit every purpose, and fail (we 
end up with "fuzzy" categories.)

> ...the fact that people can successfully
> use language, starting in early childhood, implies that
> it's possible to start with a corpus that is much, much
> smaller than totality and add more data as needed.

Naturally everybody's individual corpus must be different. To explain every 
idiosyncrasy of a given individual's productions it seems likely you would 
need that individual's entire corpus, but for understanding you would only 
need overlap.

-Rob