[Corpora-List] ANC, FROWN, Fuzzy Logic
Rob Freeman
lists at chaoticlanguage.com
Wed Jul 26 06:58:43 UTC 2006
On Wednesday 26 July 2006 00:30, John F. Sowa wrote:
>
> > Otherwise put, that experimental observations are the
> > most compact representations for many systems.
>
> But that does not imply that natural language data cannot
> be compressed ... even though perfect compression may be
> impossible.
To say "perfect compression may be impossible" is to concede the point I wish
to make.
> But there is abundant evidence that NL data can be compressed.
I see no evidence NL data can be compressed "completely". On the contrary, the
evidence indicates to me that any compression of NL data must be
"incomplete" (and each incomplete compression involves a loss of information
which can only be prevented by retaining the whole corpus anyway.)
We've been running around for 50 years or more finding incomplete
compressions. You would think we'd get the hint.
Reading Chaitin helps us understand this is normal, and representative of a
much broader "problem" in science (actually a solution, because it provides a
path to much greater representational power than any single set of rules
could provide.)
This doesn't mean compression of NL data to find rules or classes is wrong.
(The news is good. Most of the actual computational machinery we've been
using to analyze corpora is still useful.) It just means any abstraction of
NL data into a system of rules or classes must be seen as specific to a
purpose.
Understanding this is the key unlocking the power of the corpus.
Knowing this means we can find the abstraction (grammar) relevant to any
purpose we choose: make a given parsing decision, agree on the significance
of a word in a given context. Not knowing this means we constantly swim
around trying to find a single abstraction to fit every purpose, and fail (we
end up with "fuzzy" categories.)
> ...the fact that people can successfully
> use language, starting in early childhood, implies that
> it's possible to start with a corpus that is much, much
> smaller than totality and add more data as needed.
Naturally everybody's individual corpus must be different. To explain every
idiosyncrasy of a given individual's productions it seems likely you would
need that individual's entire corpus, but for understanding you would only
need overlap.
-Rob
More information about the Corpora
mailing list