[Corpora-List] ANC, FROWN, Fuzzy Logic
Mike Maxwell
maxwell at ldc.upenn.edu
Thu Jul 27 03:00:49 UTC 2006
Rob Freeman wrote:
> We've been running around for 50 years or more finding incomplete
> compressions. You would think we'd get the hint.
I don't get the hint, even after you've told me there is a hint :-).
I can certainly believe that human beings internalize a grammar without
believing that the grammar needs to be "perfect" in any sense.
I can also believe that the grammar does not need to extract every last
bit of entropy out of the language (and I mean _language_, not corpus,
see below).
But let's get down to some actual data, and theories. The degree to
which the compression should proceed was precisely the point behind a
lot of the arguments--particularly among phonologists, the point is less
clear in syntax--over abstractness. To take an example, in one of his
papers Morris Halle argued (or maybe just assumed) that such
semi-regular verbs in English as 'weep' and 'keep' in fact have a
rule-governed past tense ('wept' and 'kept', etc.). I, on the other
hand, think it's completely possible that native speakers of English do
not extract such a rule (although they do extract the rules for regular
past tense verbs). (Of course it's possible that some native speakers
do, and others do not, extract such a rule.)
Another example along the same lines would be the diphthongizing verbs
in Spanish, like 'venir', whose stem diphthongizes to 'vien' when
stressed. James Harris has argued for a rule-governed approach, which
requires a diacritic. Again, it's perfectly possible that native
speakers of Spanish just memorize the irregular stems, i.e. that their
internalized grammars don't do perfect compression.
In cases like these, linguists can argue--and have argued--for a greater
or lesser degree of compression. And no one ever worried, afaik, about
whether the compression had to be perfect (although admittedly, there
were some pretty abstract analyses in the bad olde days).
(BTW, it's unclear to me--as I think another poster pointed out--whether
compression of a corpus by a grammar is at all relevant. What grammars
do, I would say, is compress the _language_, of which the corpus is but
a small sample. One can test whether the grammar works by telling how
well it compresses a given corpus of the language, but I don't see the
point to asking whether we perfectly compress some arbitrary corpus.)
--
Mike Maxwell
maxwell at ldc.upenn.edu
More information about the Corpora
mailing list