Corpora: Corpus Linguistics

Mon Apr 9 00:48:45 UTC 2001

Christopher Bader said:
>> 2.  In his more recent work, Chomsky distinguishes between
>> the E-language (e.g. the set of all grammatical sentences)
>> and the I-language (the human language faculty).  Generative
>> grammarians study the latter; corpus linguists, the former.
>> The Chomsky Hierarchy and Chomsky Normal Form are
>> of course concepts pertaining to the E-language, not to
>> the I-language, which is why Chomsky no longer works
>> in this area.

Tony Mcenery commented:
> I see no problem with the above statement, other than to say that at
> times Linguistics has excluded the study of E-language (in the sense of
> attested language use as opposed to the concoction of invented examples)
> as being part of linguistics proper.

Ramesh comments:
I suggest that Tony has not gone far enough in just modifying the definition
of E-language ("in the sense of attested language use").
Even to accept that "corpus linguists study E-language" (in any sense) is
to describe apples in terms of oranges. Why use Chomskyan terms at all?

a) If E-language is the set of *all* grammatical sentences, surely not even
the most idealistic corpus linguist would claim that this is what they are
currently studying, or even hoping to study in some distant future. The
billion-word corpora that are just around the corner (and allowing for
even further massive leaps in corpus size beyond that) will still represent
only a small sample of any language community's total discourse. All that
corpus linguists will ever be able to say is that certain linguistic
features and patterns are well attested in a particular corpus, and others
are rare or marked/constrained in some way. Models extrapolated from the
well attested features are likely to be fairly robust, but will always
leave some new input data categorially indeterminate between error,
creativity, local usage, humour, and so on.

b) If E-language is the set of all *grammatical* sentences, again I would
doubt that corpus linguists would say that this is what they are studying.
Grammaticality is only one criterion in the evaluation of a corpus instance.
"Ungrammatical" instances are a valid part of a descriptive language model,
and may be accounted for in various ways, for example by reference to
real-time interactional factors, pragmatics, sociolinguistics, pathology,
or other extra-linguistic factors. On the other hand, many grammatically
possible, invented examples may remain unsubtantiated by corpus attestation
for a very long time. My own research into countless invented dictionary
examples shows that many are absent or extremely rare in the 418 million-word
Bank of English corpus, for example. Texts produced for purposes other than
linguistic exemplification or disputation exhibit features of a quality
which John Sinclair has termed "naturalness", which appear to be beyond
capture by grammaticality alone.

Instead of adopting a Pythonesque "What has Chomsky ever done for
linguistics/usable language technology/etc"
stance, and then having to refute various suggestions made, or
having to concede gradually "apart from X, Y, and Z", corpus linguists
would do well to continue to plough their own furrow. A bottom-up methodology
will inevitably take longer to arrive at the higher reaches of observational
adequacy, let alone to satisfy any other adequacies (if that is what our
aim is...). And of course, language is dynamic, so not only the descriptive
model, but even the notion of observational adequacy, will have to be
dynamic as well...

Ramesh Krishnamurthy
COBUILD/University of Birmingham/Collins Dictionaries