[Corpora-List] Re: ANC, FROWN, Fuzzy Logic
FIDELHOLTZ_DOOCHIN_JAMES_LAWRENCE
jfidel at siu.buap.mx
Wed Jul 26 09:53:04 UTC 2006
Hi, all,
Before I start, I should clarify that I have never worked on compression, so
maybe I'm missing something obvious to those who do work on it. Still, I
can't buy Rob's claim that
> To explain every idiosyncrasy of a given individual's productions it
> seems likely you would need that individual's entire corpus, but for
> understanding you would only need overlap.
Getting a 'complete' corpus for any individual would be theoretically
impossible, since no one produces all of their knowledge about language. Ie,
there *does* exist passive knowledge (as well as implicit knowledge, not the
same), as we know from all those 'silent period' kids (most of them) before
they start speaking, but *do* understand (often not speaking at all until
well into their second year of life); the well-known fact that our 'passive'
vocabulary is much larger than the vocabulary we use; etc.
On the other hand, it takes rather little exposure to a language to begin to
make noticeable strides in acquiring it, as anyone who has ever learned a
second language 'in situ' as an adult knows very well. You don't ever get it
all, but you sure can get significant 'overlap', as Rob would say. And an
hour or so of a demonstration of this to an impressed MIT freshman by
Kenneth Pike is what made me into a linguist (that and a serendipitous but
super course from Morris Halle a couple of years later).
Earlier, Rob says:
> To say "perfect compression may be impossible" is to concede the point I
> wish to make.
> ...
> I see no evidence NL data can be compressed "completely". On the
> contrary, the evidence indicates to me that any compression of NL data
> must be "incomplete" (and each incomplete compression involves a loss
> of information which can only be prevented by retaining the whole
> corpus anyway.)
Well, for starters, though it's a trivial example, we have a good example of
a perfectly compressible code: (original) ASCII: a six-bit set on which we
'waste' 8 bits per 'letter'. We can compress this to 20% of its former size
in bytes, and get 100% correct results back. Now that ain't exactly NL, but
I draw the conclusions that: 1) compression doesn't necessarily have to be
lossful, or at least not too bad (and remember eg that any audio signal you
can make out, however noisy, can be 'made out' [ie, greatly cleaned up,
though with great loss--esp. of noise] by the computer using cepstra); 2)
good rules (or their equivalent in whatever theory you're partial to) make
all this possible. No linguist, however poor, would deny the importance of
having good generalizations about a particular language, corpus, etc. And no
decent linguist, however good, would (or certainly: should) deny that their
analysis of a particular language, corpus, etc. could be bettered. That's
what science is all about, after all.
I don't think we need 'complete compression' to have perfectly useful
results. After all, even humans make mistakes (!): in understanding, in
production, etc. And even linguists, in putting POS tags, for example, don't
do much better (in agreement among themselves) than the best empirical
computer programs which get up to about 99% 'correct'. Of course, I would
maintain that the 1% not covered here would vary wildly between linguists
and computers (the latter making mistakes heavily among the least frequent
words and uses, for example, which would be much less problematical for
humans, in general). This does not indicate to me that computer processing
is impossible, but rather that we just need better, more 'human-like'
algorithms. (Not that it's trivial to discover them, of course.) Now, 99%
correct, on the face of it, sounds great, until you reflect that in a corpus
of 100 megawords, say, (nowadays, a smallish or at best medium-sized corpus)
that implies a million words incorrectly classified (and, I would maintain,
precisely some of the linguistically most interesting cases; although, I
must admit, for many practical purposes this probably *would* be great or at
least useful).
One final point about compression of NL data. No practicing linguist can
have failed to notice that *none* of the rules (again: or the
equivalent--from here on, I will just talk about 'rules', with this
parenthesis understood) which they have come up with is without exception. I
can't think of a single rule I have ever come into contact with that doesn't
have some exceptions (eg, linguists marvel over the description of a
language [I forget which one] whose *only* irregular verb is 'to be'--but
there *is* still one). Virtually all languages with any conjugation at all
have at least a few deponent verbs. Etc.
As Halle long ago remarked, however, exceptions may prove (ie, test) the
rule, but they don't invalidate it. They may either indicate the necessity
of a reformulation of the rule (remember Verner's Law, still famous after
over 150 years as a 'correction' to Grimm's Law [actually, this should
probably be: Grimms' Law], one of the most famous rules in linguistics), or
they may be *real exceptions*, which all practicing linguists know really
*do* exist. Our aim as analysts of language is to throw out the bathwater
(the detritus of Verner's Law, eg) while keeping the baby (the rule: Grimm's
Law), while still permitting true exceptions (eg, here, some onomatopoetic
words, but also a few 'normal' words). Now, the description in the previous
sentence is actually in the *best* of circumstances (eg, right after the
'completion' of Grimm's Law). Later borrowings, analogic creations, etc. can
further screw up the system, and in some cases (eg, English fricative
voicing) demolish or radically restructure parts of the system. But ya gotta
keep the baby!
In a different vein, socioLINGUISTICS (in the sense where rules spread
geographically and/or socially and/or partially [with respect to features,
eg]; along with markedness) has been able to allow linguists to give nuances
to the possible implementations of rules. Eg, with respect to partiality of
Grimm's Law, this permits us to understand the so-called Rhenish Fan.
At one point, Rob says:
> Knowing this means we can find the abstraction (grammar) relevant to
> any purpose we choose: make a given parsing decision, agree on the
> significance of a word in a given context. Not knowing this means
> we constantly swim around trying to find a single abstraction to fit
> every purpose, and fail (we end up with "fuzzy" categories.)
Well, I guess I have to admit that nearly all linguistic categories are
fuzzy. That is decidedly *not*, however, a research strategy. The *only*
reasonable (ie, scientific, I would say) research strategy is to always
assume that any hypothesized categories are strict (yes or no) and see what
that produces as results. If those results are unacceptable or
contradictory, we should still, IMHO, carry on to the bitter end before
backing up, because some of the further consequences of unacceptable
conclusions may be enlightening in future research. Of course, since
False(1) implies False(2), we have no permanent results yet, but still they
may be useful in the future. And now you can see why I have never won the
Nobel Prize (aside from the fact that I'm a linguist). To get back to the
point, having discovered cases which apparently may fit in either of the
hypothesized categories, there are still several options before accepting
fuzzy categories, however conceptually appealing these latter may seem to
be. For one, we may have missed a category (eg, if Adjectives sometimes
behave as Verbs and sometimes as Nouns, it may indicate that these latter
two categories are 'fuzzy'; or it may indicate that we need a further
category Adjective; or it may indicate that we need a breakdown into some
sort of Distinctive Features: Verb = [+verb, -noun]; Noun =
[-verb, +noun]; Adjective = [+verb, +noun]. This last possibility, however,
itself produces automatically further corrollaries, eg that there should
exist another category [-verb, -noun]. Now in turn, this could be, say,
Adverb; or it could imply a hierarchically superior category [+/-Major
Class]. [+Major Class] divided by the [of course, I am assuming binary
features, a whole nother discussion] features [+/- verb], [+/- noun] and
[-Major Class], which would include everyting else: so-called function
words, prepositions, markers, interjections {yes, Virginia, this *is* a real
*linguistic* category!}, clitics, etc.)(bet you thought I'd forgotten that
closing parenthesis).
Anyway, every hypothesis leads to further hypotheses, back to corrections,
on to further hypotheses, etc. Likewise, this is a cooperative enterprise.
That's why we rejoice when our hypotheses get shot down, either by ourselves
(best case scenario, obviously, and, after all, our obligation to try to do)
or by others (thanks, guys). In the latter case, at least we know that
someone is reading our work.
OK. That's my story and I'm sticking to it (that's what happens when you let
Old Dogs into the list!).
Jim
James L. Fidelholtz
Posgrado en Ciencias del Lenguaje, ICSyH
Benemérita Universidad Autónoma de Puebla MÉXICO
Rob Freeman escribió:
...
More information about the Corpora
mailing list