[Corpora-List] ANC, FROWN, Fuzzy Logic

Wed Jul 26 16:22:26 UTC 2006

Ken, Rob, Jim, and Mark,

I mentioned the point about compression only to tie the
discussion to Chaitin's work.  If any corpus is truly random,
no compression is possible.  But if any compression is
possible, then there must exist a more compact description
than a complete enumeration of everything in the corpus.

RF> On the contrary, the evidence indicates to me that any
 > compression of NL data must be "incomplete" (and each incomplete
 > compression involves a loss of information which can only be
 > prevented by retaining the whole corpus anyway.)
 >
 > We've been running around for 50 years or more finding incomplete
 > compressions. You would think we'd get the hint.

I don't know what hint you're suggesting.  That no rule-based
system can ever be complete?  I think that's obvious.  That
an incomplete compression is useless?  I would very strongly
disagree with any such claim.

JLD> No linguist, however poor, would deny the importance of
 > having good generalizations about a particular language, corpus,
 > etc. And no decent linguist, however good, would (or certainly:
 > should) deny that their analysis of a particular language, corpus,
 > etc. could be bettered.

That is the point I was trying to emphasize.  Although I agree with
Rob that having access to corpus data is valuable during language
analysis, it should be possible to do a large part of the analysis
by means of some more compact method.

The goal of linguistics is to characterize that method, but I'll
avoid any claim that the method must be based on logic, rules,
neurons, or statistics.

MPL> For science to work, theories and other models don't have
 > to be things that are "true". They just have to be things that
 > are _useful_ -- and that implies a purpose against which any
 > scientific model must be evaluated. (Bas van Fraassen)

I agree to a large extent, but I would emphasize the distinction
between engineering and pure science.  The question of "truth" --
i.e., a correspondence with some reality that exists independently
of what we may think about it -- is science, but the question of
usefulness is engineering.  Both are important, but we should be
clear about which goals we are pursuing in any particular project.

For example, the evidence seems to show that Chomsky's distinction
between performance and competence was a dead end for science, but
there may still be valid engineering uses for much of the rule-based
technology that was inspired by Chomsky's work.

KL> When I generate, I feel very much as if my use of a particular
 > word may change from one draft of a paper to the next, i.e.,
 > my whole semantic network of associations changes from day to day.

I agree.  I like Alan Cruse's word "microsense" for the subtle
variations.  Below is a famous quotation from Steiner.  But I don't
believe that we need complete corpora.  When we're talking with
someone, we can just ask a question if we're not sure about his or
her meaning.  And in many cases, the speaker isn't sure either
(note St. Augustine's point about time -- he knows what it is
until somebody asks him).

John Sowa
______________________________________________________________________

 From Steiner, George (1975) After Babel:  Aspects of Language and
Translation, Oxford University Press, Oxford, third edition 1998.

No two historical epochs, no two social classes, no two localities use 
words and syntax to signify exactly the same things, to send identical 
signals of valuation and inference. Neither do two human beings. Each 
living person draws, deliberately or in immediate habit, on two sources 
of linguistic supply:  the current vulgate corresponding to his level of 
literacy, and a private thesaurus. The latter is inextricably a part of 
his subconscious, of his memories, so far as they may be verbalized, and 
of the singular, irreducibly specific ensemble of his somatic and 
psychological identity. Part of the answer as to whether there can be 
'private language' is that aspects of every language act are unique and 
individual. They form what linguists call an 'idiolect'. Each 
communicatory gesture has a private residue. The 'personal lexicon' in 
every one of us inevitably qualifies the definitions, connotations, 
semantic moves current in public discourse. The concept of a normal or 
standard idiom is a statistically-based fiction (though it may, as we 
shall see, have real existence in machine translation). The language of 
a community, however uniform its social contour, is an inexhaustibly 
multiple aggregate of speech-atoms, of finally irreducible personal 
meanings.... Thus a human being performs an act of translation, in the 
full sense of the word, when receiving a speech-message from any other 
human being. (pp. 47-48)