Corpora: Chomsky and corpus linguistics

Mike Maxwell Mike_Maxwell at sil.org
Sun Apr 8 21:14:46 UTC 2001


Tony Mcenery wrote:
>...let me try my best to have a little snip at Samson's locks. 

Since I'm the one who was called Samson in Christopher Bader's msg (which would surely be a shock to anyone I went to grade school with--I was a wimp), I guess it falls on me to wake up before Delilah gets any further.  (Actually, I am in need of a haircut before I go to a job interview at a natural language engineering place in a couple weeks, but I think I'll pay for that, thank you!)

>The usable language technology that I know of now
>owes the greatest debt to corpus based approaches
>to the study of language.

That may be true, but recall that before 1903, aviation relied on hot air balloons.  There's a lot of hot air here, er, I digress...

All seriousness aside, I think I'll bring in a little history lesson for you young whippersnappers.  Back in the 1980s, many people in the NLP world were convinced that the future lay in what were called "semantic grammars."  The idea was that instead of writing a linguistically-based grammar for English (or in rare cases, some other language), then translating the structures your parser produced into some kind of semantic representation, you would bypass the linguistic grammar and go straight for a semantic description.  That is, you wrote rules like
    S --> RUNNER RUNNING-ACTION (MOVEMENT-ADVERB)
to parse something like "John runs fast."

Wild and wooly semantic-grammar based applications were demonstrated which worked fine in demos, but which tended to fall apart in real life (not to mention the fact that absolutely nothing in any of these grammars was portable to a new domain).  But at some point in time, I suppose someone could have claimed that (most of) "the usable language technology" was based on semantic grammars.  Whether that was true or not is not something I would want to argue, nor do I want to argue whether current "usable language technology" is based mostly on corpus linguistics.  What we really want to know is not what has worked thus far, but what is the way forward?  This Samson isn't enough of a prophet to know that, nor is anyone else out there, I daresay--each of us is just climbing what looks to be the largest hill around, and the tops are all covered in clouds.

>Smith continues to argue that scientists work on 
>idealised examples and that people using 'common 
>sense' misunderstand the true goals of science.  In 
>characterising linguistics in this way, Smith arguably 
>casts corpus linguists as non-linguists and non-scientific...

I haven't read Smith's book, so I shouldn't say anything.  But the real-life Samson was nothing if not frisky, so let me tie a few torches to foxes tails and let them loose in the Philistines' field.  (If you don't understand that allusion, you need to read your Bible!)  

Before Newton, there was a theory of movement, originally devised by Aristotle.  This theory was comprehensive, covering not only ballistic movement, but virtually every kind of movement, including the growth of plants.  (At least that's what I recall--and no, I wasn't there.  I'm not _that_ old.)  Newton's theory was a step backward, from a certain point of view; it only covered certain kinds of movement, and didn't try to explain a great many things that Aristotle's theory did cover.  Worse, it made certain idealizations, such as lack of friction--idealizations that were patently false in the real world.  And real world facts, like friction, are things that engineers can't afford to ignore.  (That's not to say there isn't a science of friction, but that's another story, and another century.)

Bringing this back to the present thread: whether corpus linguists are linguists, I won't attempt to say.  But I will say that they tend to be much more engineers than scientists.  Chomsky, OTOH, is a scientist.  Sometimes the scientists produce things the engineers can make use of; sometimes they don't.  (There aren't many practical applications of neutron stars, and probably won't be for a long time.)  So if Smith seems to cast "corpus linguists as non-linguists and non-scientific", then I am at least half way in agreement.  But then, I'd rather drive a car designed by an engineer than one designed by a scientist who ignored friction!

>...I think it is quite relevant to point out how in 
>the presentation of his ideas the work and worth
>of corpus linguistics is often grossly misrepresented 
>by those linguists who work in the tradition Chomsky 
>has established.

But the original argument of this thread was whether the work and worth of _Chomsky's_ ideas were grossly misrepresented by those linguists who work in the tradition of corpus linguistics.  (Not that the other point might not be true.)

>I would also claim that the focus on corpus data by 
>some linguists has also led to more practical 
>applications of linguistics than work conducted 
>in the Chomskyan paradigm ever will.

It might be true that corpus data has led to lots of practical application, but the "ever will" is harder to say.  Always shifting, the future is.  From my personal point of view, I think a hybrid system (theory-based grammar, with probabilities of various interpretations determined by corpus-based studies) has a certain appeal, at least for the near future.  But I could be wrong.

>I guess the Samsons are now going to tell me 
>all of the practical applications of the minimalist 
>paradigm that there are!

Well, this Samson won't.  Because the stuff I've done has been based on older work by Chomsky and other generativists.  It does seem that in recent years (or decades), the MIT school has shifted more attention to explanatory adequacy (and with Minimalism, adequacy of a different sort--I'd say "optimality adequacy", except that "optimality" has been pre-empted by another school).  And as they shift, I see less and less that would help me make an NLP application.  There are, however, other generative schools who have maintained more of a focus on observational and descriptive adequacy (Stanford, for example), and I continue to find their work helpful.

Time to see if that lion's carcass has any honey in it...

      Mike Maxwell
      Summer Institute of Linguistics
      Mike_Maxwell at sil.org



More information about the Corpora mailing list