[Corpora-List] Chomsky and computationnel linguistics

Wed Jul 11 08:21:12 UTC 2007

Mike, I pretty much agree with you, but I think it's worth reflecting on 
the proposition that not all data is alike, and some is more equal than 
others. In the case of generative syntax, as it's been done over the 
last nearly 50 years, most of the data that's been used, I would guess, 
has been about phenomena where there would have been no problem in 
finding examples in real corpora, and a search for examples in real 
corpora might have suggested new insights. In a relatively small number 
of areas---I label these "the parasitic gaps phenomena" in my 
mind---examples are so rare that corpus search may give one too poor a 
harvest, so if one thinks of these phenomena as having an importance 
that offsets their rarity in data,  one is forced to go the intuition 
route.

What would you say to the following credo for syntacticians: using one's 
one native speaker intuitions is a useful heuristic for developing 
hypotheses, but these judgments should be used only as a last resort as 
the basis for a serious argument (in a publication, thesis, etc.). Since 
science consists both of finding elegant generalizations and showing 
that they correspond well to the facts rather than what we would like 
the facts to be, the linguist is obliged to submit his/her theories to 
the best tests currently available. In the  relatively rare case where 
the issue generates only hypotheses that are difficult to test in data 
that has a reality beyond the linguistic imagination of the linguist, 
that lack of empirical grounding should be established and acknowledged.

I think there are serious objections to this with regard to phonology 
and morphology, so I expect some with regard to syntax as well. But I 
think some good arguments can be made for the position I sketched --

best,
John Goldsmith

Mike Maxwell wrote:
> Philip Resnik wrote:
>> ...there's an emerging community trying to use the
>> tools of computational linguistics with corpora to enable linguistic
>> theoreticians to be more empirical in their approach without
>> abandoning their paradigm.    ...
>> Work of this kind offers value to theoretical linguists because the
>> standard paradigm of inventing and judging examples can easily miss
>> relevant facts, and lead to false generalizations.
>
> If I were doing theoretical syntax, I think this is exactly where I'd 
> find myself: using corpora to keep me from missing relevant facts, 
> that is, examples from people whose judgements might differ from my 
> own (like my spell checker differs from me about how 'judgement' 
> should be spelled, but that's another question...), or making me 
> consider constructions that I might otherwise have overlooked.
>
> There is a danger in corpus linguistics of conflating dialects, using 
> texts produced by non-native speakers, etc.  And crucial constructions 
> may simply be so rare that you just can't find them in the available 
> corpus.  (Crucial examples of certain kinds of reduplication in 
> "exotic" languages are an example of hard-to-find data in a corpus, 
> and that's not even syntax.)  So there's still room, it seems to me, 
> for introspection (or asking the person in the next office, or 
> elicitation from an informant).  But as you say, the corpora add 
> value, too.
>
> (Other kinds of linguistics, like lexicography, have been about corpus 
> collection for centuries, of course.)