Corpora: Chomsky/Harris

Sun Apr 1 22:29:43 UTC 2001

Hello Steve,

I thought perhaps I should mention on list what I told Tony privately
the other day.

In my mind, when someone mentions parsing and transformations, Brill's
transformation-based error-driven learning method is what jumps into my
mind almost immediately.

If you take a look at his dissertation, he cites Harris quite a bit.  It
would appear that transformation-based error-driven learning is based
somewhat on Harris's ideas.  You can find Brill's dissertation on his
old homepage (#11):

	http://www.cs.jhu.edu/~brill/acadpubs.html

And since transformation-based error-driven learning is a robust machine
learning paradigm that has been successfully used for a number of NLP
tasks, such as part-of-speech tagging, parsing, prepositional phrase
attachment, subordinate conjunction attachment, grammatical relation
extraction and word segmentation, I told Tony that it is probably what
he is looking for.  Algorithms based on transformation-based
error-driven learning can perform as well as or better than Hidden
Markov Models.

I do agree that there is somewhat of a disconnect between theoretical
linguistics and corpus linguistics, but I also think that this distance
is being narrowed somewhat as each camp begins to realize that the other
has useful methods to offer.

As a person with two degrees in Linguistics (B.A. & M.A.) and almost 10
years of full-time computer programming experience, I am fortunate to
feel comfortable in either camp.

-- Mary D. Taffet
   Syracuse University
   Ph.D. Student/School of Information Studies
   Research Analyst/Center for Natural Language Processing
   4-230 Center for Science & Technology
   Syracuse, NY  13244-4100
   mdtaffet at syr.edu

Steve Seegmiller wrote:
>
> This is a reply to a query from Tony Perretta, which
> a colleague forwarded to me. (I am not a subscriber to
> this list.)
>
> First a point of clarification: Chomsky has never, to my
> knowledge, "discredited" the use of corpora. There is a
> bit of a terminological mix-up here, I think, in that
> Chomsky did attack the idea that a corpus defines a
> language; i.e. that a grammar should be based solely on
> the data found in an observed corpus. His point (with
> which you cannot disagree, if you look at the relevant
> examples) is that no corpus, no matter how large, can
> contain every sentence, or even every sentence type,
> in the language; and furthermore, that many kinds of
> perfectly good sentences (that the grammar should take
> into account) have a probability of occurrence in a
> given corpus that is indistinguishable from zero. The
> conclusion is that a corpus is never enough.
>
> That is quite different from saying that corpora are
> not useful sources of data. Anyone who has worked with
> a large corpus has found many many surprises there,
> including lexical uses and syntactic constructions that
> s/he would not have thought of otherwise.
>
> It is unfortunate that many people in the corpus
> linguistics community have put themselves in opposition
> to Chomskyan linguists. (At the recent conference on
> Corpus Linguistics and Language Teaching in Boston,
> sevral references were made to "the enemy' at MIT.
> That is a most unfortunate, and unnecessary, view.)
> There is no iherent incompatibility between theoretical
> generative linguistics and corpus linguistics, and
> by focussing on the enmity, many corpus linguists are
> making it impossible to discuss the real issues
> involved.
>
> Having said all that, I have a very little information
> on Harris's approach to parsing and such things. Harris
> developed, in addition to his transformational analysis,
> something called tring grammar, which was a non-
> transformational kind of analysis which encoded certain
> transformation-type information. It was much
> easier, in the early days of computatinal linguistics,
> to program string grammar than transformational grammar,
> so several of Harris' students adopted string grammar
> as the basis for parsers, informational retrieval systems,
> etc. One such project was the String Project at New York
> University, directed by Naomi Sager. I believe it is still
> in operation. Another implementation was built by Aravind
> Joshi. I do not know specifically of any statistical
> parsers based on Harrisian transformational grammar,
> but parsing is not my field so there could well be some.
>
> Best wishes,
>
> Steve Seegmiller, Ph.D.
> Linguistics Department
> Montclair State University
> Upper Montclair, NJ 07043
>
> seegmillerm at alpha.montclair.edu
> http://www.chss.montclair.edu/linguistics/lingpage/faculty/seeg/seeg.htm