[Corpora-List] Re: ANC, FROWN, Fuzzy Logic

FIDELHOLTZ_DOOCHIN_JAMES_LAWRENCE jfidel at siu.buap.mx
Wed Jul 26 09:53:04 UTC 2006


Hi, all, 

Before I start, I should clarify that I have never worked on compression, so 
maybe I'm missing something obvious to those who do work on it. Still, I 
can't buy Rob's claim that 

> To explain every idiosyncrasy of a given individual's productions it
> seems likely you would need that individual's entire corpus, but for 
> understanding you would only need overlap.

Getting a 'complete' corpus for any individual would be theoretically 
impossible, since no one produces all of their knowledge about language. Ie, 
there *does* exist passive knowledge (as well as implicit knowledge, not the 
same), as we know from all those 'silent period' kids (most of them) before 
they start speaking, but *do* understand (often not speaking at all until 
well into their second year of life); the well-known fact that our 'passive' 
vocabulary is much larger than the vocabulary we use; etc. 

On the other hand, it takes rather little exposure to a language to begin to 
make noticeable strides in acquiring it, as anyone who has ever learned a 
second language 'in situ' as an adult knows very well. You don't ever get it 
all, but you sure can get significant 'overlap', as Rob would say. And an 
hour or so of a demonstration of this to an impressed MIT freshman by 
Kenneth Pike is what made me into a linguist (that and a serendipitous but 
super course from Morris Halle a couple of years later). 

Earlier, Rob says: 

> To say "perfect compression may be impossible" is to concede the point I
> wish to make.
> ...
> I see no evidence NL data can be compressed "completely". On the 
> contrary, the evidence indicates to me that any compression of NL data
> must be "incomplete" (and each incomplete compression involves a loss
> of information which can only be prevented by retaining the whole
> corpus anyway.)

Well, for starters, though it's a trivial example, we have a good example of 
a perfectly compressible code: (original) ASCII: a six-bit set on which we 
'waste' 8 bits per 'letter'. We can compress this to 20% of its former size 
in bytes, and get 100% correct results back.  Now that ain't exactly NL, but 
I draw the conclusions that: 1) compression doesn't necessarily have to be 
lossful, or at least not too bad (and remember eg that any audio signal you 
can make out, however noisy, can be 'made out' [ie, greatly cleaned up, 
though with great loss--esp. of noise] by the computer using cepstra); 2) 
good rules (or their equivalent in whatever theory you're partial to) make 
all this possible. No linguist, however poor, would deny the importance of 
having good generalizations about a particular language, corpus, etc. And no 
decent linguist, however good, would (or certainly: should) deny that their 
analysis of a particular language, corpus, etc. could be bettered. That's 
what science is all about, after all. 

I don't think we need 'complete compression' to have perfectly useful 
results. After all, even humans make mistakes (!): in understanding, in 
production, etc. And even linguists, in putting POS tags, for example, don't 
do much better (in agreement among themselves) than the best empirical 
computer programs which get up to about 99% 'correct'. Of course, I would 
maintain that the 1% not covered here would vary wildly between linguists 
and computers (the latter making mistakes heavily among the least frequent 
words and uses, for example, which would be much less problematical for 
humans, in general). This does not indicate to me that computer processing 
is impossible, but rather that we just need better, more 'human-like' 
algorithms. (Not that it's trivial to discover them, of course.) Now, 99% 
correct, on the face of it, sounds great, until you reflect that in a corpus 
of 100 megawords, say, (nowadays, a smallish or at best medium-sized corpus) 
that implies a million words incorrectly classified (and, I would maintain, 
precisely some of the linguistically most interesting cases; although, I 
must admit, for many practical purposes this probably *would* be great or at 
least useful). 

One final point about compression of NL data. No practicing linguist can 
have failed to notice that *none* of the rules (again: or the 
equivalent--from here on, I will just talk about 'rules', with this 
parenthesis understood) which they have come up with is without exception. I 
can't think of a single rule I have ever come into contact with that doesn't 
have some exceptions (eg, linguists marvel over the description of a 
language [I forget which one] whose *only* irregular verb is 'to be'--but 
there *is* still one). Virtually all languages with any conjugation at all 
have at least a few deponent verbs. Etc. 

As Halle long ago remarked, however, exceptions may prove (ie, test) the 
rule, but they don't invalidate it. They may either indicate the necessity 
of a reformulation of the rule (remember Verner's Law, still famous after 
over 150 years as a 'correction' to Grimm's Law [actually, this should 
probably be: Grimms' Law], one of the most famous rules in linguistics), or 
they may be *real exceptions*, which all practicing linguists know really 
*do* exist. Our aim as analysts of language is to throw out the bathwater 
(the detritus of Verner's Law, eg) while keeping the baby (the rule: Grimm's 
Law), while still permitting true exceptions (eg, here, some onomatopoetic 
words, but also a few 'normal' words). Now, the description in the previous 
sentence is actually in the *best* of circumstances (eg, right after the 
'completion' of Grimm's Law). Later borrowings, analogic creations, etc. can 
further screw up the system, and in some cases (eg, English fricative 
voicing) demolish or radically restructure parts of the system. But ya gotta 
keep the baby! 

In a different vein, socioLINGUISTICS (in the sense where rules spread 
geographically and/or socially and/or partially [with respect to features, 
eg]; along with markedness) has been able to allow linguists to give nuances 
to the possible implementations of rules. Eg, with respect to partiality of 
Grimm's Law, this permits us to understand the so-called Rhenish Fan. 

At one point, Rob says: 

> Knowing this means we can find the abstraction (grammar) relevant to
> any purpose we choose: make a given parsing decision, agree on the
> significance of a word in a given context. Not knowing this means
> we constantly swim around trying to find a single abstraction to fit
> every purpose, and fail (we end up with "fuzzy" categories.)

Well, I guess I have to admit that nearly all linguistic categories are 
fuzzy. That is decidedly *not*, however, a research strategy. The *only* 
reasonable (ie, scientific, I would say) research strategy is to always 
assume that any hypothesized categories are strict (yes or no) and see what 
that produces as results. If those results are unacceptable or 
contradictory, we should still, IMHO, carry on to the bitter end before 
backing up, because some of the further consequences of unacceptable 
conclusions may be enlightening in future research. Of course, since
False(1) implies False(2), we have no permanent results yet, but still they 
may be useful in the future. And now you can see why I have never won the 
Nobel Prize (aside from the fact that I'm a linguist). To get back to the 
point, having discovered cases which apparently may fit in either of the 
hypothesized categories, there are still several options before accepting 
fuzzy categories, however conceptually appealing these latter may seem to 
be. For one, we may have missed a category (eg, if Adjectives sometimes 
behave as Verbs and sometimes as Nouns, it may indicate that these latter 
two categories are 'fuzzy'; or it may indicate that we need a further 
category Adjective; or it may indicate that we need a breakdown into some 
sort of Distinctive Features: Verb = [+verb, -noun]; Noun =
[-verb, +noun]; Adjective = [+verb, +noun]. This last possibility, however, 
itself produces automatically further corrollaries, eg that there should 
exist another category [-verb, -noun]. Now in turn, this could be, say, 
Adverb; or it could imply a hierarchically superior category [+/-Major 
Class]. [+Major Class] divided by the [of course, I am assuming binary 
features, a whole nother discussion] features [+/- verb], [+/- noun] and 
[-Major Class], which would include everyting else: so-called function 
words, prepositions, markers, interjections {yes, Virginia, this *is* a real 
*linguistic* category!}, clitics, etc.)(bet you thought I'd forgotten that 
closing parenthesis). 

Anyway, every hypothesis leads to further hypotheses, back to corrections, 
on to further hypotheses, etc. Likewise, this is a cooperative enterprise. 
That's why we rejoice when our hypotheses get shot down, either by ourselves 
(best case scenario, obviously, and, after all, our obligation to try to do) 
or by others (thanks, guys). In the latter case, at least we know that 
someone is reading our work. 

OK. That's my story and I'm sticking to it (that's what happens when you let 
Old Dogs into the list!). 

Jim 

James L. Fidelholtz
Posgrado en Ciencias del Lenguaje, ICSyH
Benemérita Universidad Autónoma de Puebla     MÉXICO 


Rob Freeman escribió: 

...



More information about the Corpora mailing list