Corpora: Chomsky and corpus linguistics

Mike Maxwell mike_maxwell at sil.org
Thu Apr 26 13:53:11 UTC 2001


Terry Murphy, quoting Robert De Beaugrande:

>...the corpus highlights the improbable and unnatural
>quality of invented data like 'John is eager to please'.

Concerning the 'improbable' (and therefore rare) nature of certain data
which has been used to argue for certain generative accounts: This is
precisely the generativists' point.  If some large group of people all have
the same judgement about the acceptability of certain constructions, and
those constructions are rare, then how can one explain their consensus?  A
case in point is parasitic gaps.  I don't know for sure, but I would guess
that they are vanishingly rare in corpora, and in the sort of input that
children get.  And yet the first time I heard constructed examples of
parasitic gaps, I, and the other linguists who were hearing the report,
immediately reacted the same way: they were "good English."  It seems to me
that there is a datum that needs explaining: you've never (or almost never)
seen something before, but it is immediately familiar.  Group deja vu.

I hasten to add (as I have said before) that some generativists have
certainly made questionable grammaticality judgements.  Simply put, there
can be bad data in acceptability/ grammaticality judgements.  But this
problem is not limited to acceptability/ grammaticality judgements; in fact,
all sciences have to deal with irreproducible data.  (And corpora have lots
of it, IMHO.)

Concerning the 'unnatural' quality of certain invented data, I guess I'm
just not sure what the problem is, or even what definition of '(un)natural'
is being used here.  Is it "unnatural" just because it didn't occur in a
corpus, or in natural conversation?  Or is it "unnatural" in the sense that
it isn't "real English" (or other language)?  If the former, that seems an
odd definition of "natural" (on a par with claiming that synthesized organic
chemicals, say, are not really organic); if the latter, what is the evidence
for the claim?  (Or maybe there's another definition of "natural" here.)

Dr. Murphy himself:
>Chomsky's comment about corpus lingustics not
>existing seems to be a logical response from
>someone whose whole enterprise would be
>undermined by the widespread adoption of real
>data as a mediator of conflicting linguistic judgements.

I doubt whether Chomsky is the least little bit worried about his enterprise
being undermined by corpus work.  But I question the phrase "real data"
here: there is nothing artificial, I claim, about introspective judgments;
and the fact that the data in a corpus wasn't produced for purposes of doing
linguistics does not in itself make it _better_ for doing linguistics.  It
may, in some circumstances, make it worse--circumstances like slips at the
keyboard, people writing at one in the morning or half drunk, non-native
speakers, etc.  Maybe there is a theory that explains the kinds of
differences created by these circumstances, and sometimes that theory will
even involve linguistics (e.g. work that's been done on slips of the
tongue), but IMHO it's wrong to say that linguistics has to explain the
output from all such circumstances, or that that sort of data is necessarily
better than introspective data.

                                         Mike Maxwell
                                         Summer Institute of Linguistics
                                         Mike_Maxwell at sil.org



More information about the Corpora mailing list