[Corpora-List] ANC, FROWN, Fuzzy Logic

Mike Maxwell maxwell at ldc.upenn.edu
Thu Jul 27 03:00:49 UTC 2006


Rob Freeman wrote:
> We've been running around for 50 years or more finding incomplete 
> compressions. You would think we'd get the hint.

I don't get the hint, even after you've told me there is a hint :-).

I can certainly believe that human beings internalize a grammar without
believing that the grammar needs to be "perfect" in any sense.

I can also believe that the grammar does not need to extract every last
bit of entropy out of the language (and I mean _language_, not corpus,
see below).

But let's get down to some actual data, and theories.  The degree to 
which the compression should proceed was precisely the point behind a 
lot of the arguments--particularly among phonologists, the point is less 
clear in syntax--over abstractness.  To take an example, in one of his 
papers Morris Halle argued (or maybe just assumed) that such 
semi-regular verbs in English as 'weep' and 'keep' in fact have a 
rule-governed past tense ('wept' and 'kept', etc.).  I, on the other 
hand, think it's completely possible that native speakers of English do 
not extract such a rule (although they do extract the rules for regular 
past tense verbs).  (Of course it's possible that some native speakers 
do, and others do not, extract such a rule.)

Another example along the same lines would be the diphthongizing verbs
in Spanish, like 'venir', whose stem diphthongizes to 'vien' when
stressed.  James Harris has argued for a rule-governed approach, which
requires a diacritic.  Again, it's perfectly possible that native
speakers of Spanish just memorize the irregular stems, i.e. that their 
internalized grammars don't do perfect compression.

In cases like these, linguists can argue--and have argued--for a greater
or lesser degree of compression.  And no one ever worried, afaik, about 
whether the compression had to be perfect (although admittedly, there 
were some pretty abstract analyses in the bad olde days).

(BTW, it's unclear to me--as I think another poster pointed out--whether
compression of a corpus by a grammar is at all relevant.  What grammars
do, I would say, is compress the _language_, of which the corpus is but
a small sample.  One can test whether the grammar works by telling how
well it compresses a given corpus of the language, but I don't see the
point to asking whether we perfectly compress some arbitrary corpus.)

-- 
	Mike Maxwell
	maxwell at ldc.upenn.edu



More information about the Corpora mailing list