Corpora: Chomsky/Harris

Steve Seegmiller seegmillerm at alpha.montclair.edu
Sun Apr 1 18:46:18 UTC 2001


This is a reply to a query from Tony Perretta, which
a colleague forwarded to me. (I am not a subscriber to
this list.)

First a point of clarification: Chomsky has never, to my
knowledge, "discredited" the use of corpora. There is a
bit of a terminological mix-up here, I think, in that
Chomsky did attack the idea that a corpus defines a
language; i.e. that a grammar should be based solely on
the data found in an observed corpus. His point (with
which you cannot disagree, if you look at the relevant
examples) is that no corpus, no matter how large, can
contain every sentence, or even every sentence type,
in the language; and furthermore, that many kinds of
perfectly good sentences (that the grammar should take
into account) have a probability of occurrence in a
given corpus that is indistinguishable from zero. The
conclusion is that a corpus is never enough.

That is quite different from saying that corpora are
not useful sources of data. Anyone who has worked with
a large corpus has found many many surprises there,
including lexical uses and syntactic constructions that
s/he would not have thought of otherwise.

It is unfortunate that many people in the corpus
linguistics community have put themselves in opposition
to Chomskyan linguists. (At the recent conference on
Corpus Linguistics and Language Teaching in Boston,
sevral references were made to "the enemy' at MIT.
That is a most unfortunate, and unnecessary, view.)
There is no iherent incompatibility between theoretical
generative linguistics and corpus linguistics, and
by focussing on the enmity, many corpus linguists are
making it impossible to discuss the real issues
involved.

Having said all that, I have a very little information
on Harris's approach to parsing and such things. Harris
developed, in addition to his transformational analysis,
something called tring grammar, which was a non-
transformational kind of analysis which encoded certain
transformation-type information. It was much
easier, in the early days of computatinal linguistics,
to program string grammar than transformational grammar,
so several of Harris' students adopted string grammar
as the basis for parsers, informational retrieval systems,
etc. One such project was the String Project at New York
University, directed by Naomi Sager. I believe it is still
in operation. Another implementation was built by Aravind
Joshi. I do not know specifically of any statistical
parsers based on Harrisian transformational grammar,
but parsing is not my field so there could well be some.

Best wishes,


Steve Seegmiller, Ph.D.
Linguistics Department
Montclair State University
Upper Montclair, NJ 07043

seegmillerm at alpha.montclair.edu
http://www.chss.montclair.edu/linguistics/lingpage/faculty/seeg/seeg.htm



More information about the Corpora mailing list