Corpora: corpus/corpora and Harris/Chomsky

Sat Apr 7 00:49:28 UTC 2001

Two threads seem to have come together, but they may mirror each other
in a way, so I will address both.

1. "a corpora":

1.1 The objection to the "misuse" of plural form in singular
context in some emails (often by, I think it's fair to say, non-expert
speakers of English, or people just starting out in corpus linguistics)
diverted attention from the *content* of their emails to the *form*.
Which is a shame, because several of the emails were pleas for help,
to which I have not seen many replies...

1.2 Such "misuse" may also be due to carelessness rather than ignorance.
How carefully do we all edit our emails? Some obviously more than others.
If we spend too long editing, we lose the spontaneity; if we don't edit
at all, we make typos, overlook errors, etc. Diieferent strokes....
(see what I mean?).

1.3 Such "misuse" may be or may become evidence for language change.
For example, do you say "the data is" or "the data are"?
Historically and etymologically, the latter
is "correct" and the former is wrong (in Latin, datum=singular,
data=plural), yet current usage seems to be roughly equal.

1.4 The evidence from the Collins COBUILD Bank of English corpus at
Birmingham University, consisting of 418 million words of 1990s data is:
"the data is" = 147 examples, attested in all subcorpora, and fairly
evenly spread
"the data are" = 159 examples, but attested only in some subcorpora, and
markedly frequent in US subcorpora and more formal BR subcorpora
(US academic textbooks, US newspapers, New Scientist, Economist)
[***full details can be supplied to any interested parties]

1.5 This reflects the fact that US academic writing adheres slightly more
to traditional uses, and that, as formerly technical terms move into
more mainstream use, their use often changes (many lay people won't know
about Latin declensions...)

1.6 The evidence for "bacteria", however, is:
"the bacteria is" = 8
"the bacteria are" = 41
which shows that the same process does not necessarily operate on all
Latinate forms to the same extent or at the same rate.

1.7 BTW, there were only two examples of "a data", both from
British spontaneous spoken data:

A: There's one piece of data I don't dispute that
B: Yeah. Right. But a data is not a fact.

C: ... my point of view is so maybe er you ... calculated and maybe you
... came across such a data which er give us possibility to compare the
losses...

1.8 Which reminds me that emails are a curious halfway-house between
spoken and written modes: spontaneous emails are closer to spoken,
and therefore more likely to contain more idiosyncratic uses.

1.9 "corpi" derives the plural form from the wrong Latin declensional
paradigm. But hey, we are participating in the creation of the English
of the future here, not correcting people's knowledge or ignorance of
Latin. If in 100 years' time, consensus use prefers "corpi", what's
the problem?

1.10 Finally, do we correct everyone who says "a graffiti", because
we happen to know that "graffiti" is plural in the original Italian,
and the singular should be "graffito", or should I correct everyone who
mispronounces my name (most non-Indians are actually incapable of
producing the correct pronunciation even when coached).

2. The Chomsky/MIT/generative debate.

2.1 This seems to have provoked an equally strident debate, which
also reflects underlying "right/wrong" beliefs.

2.2 However, in between the polemic I have discerned several nuclei
(is that a "correct" usage?) of useful historical information,
interesting perspectives on the relationships between different
branches and traditions of linguistics, and quite a lot of humour!
:--)

2.3 IMHO, this is one of the best threads I have seen for a long time on
this list. I think people should have a chance to mouth off about their
pet hates, niggles, etc, as it stimulates others to think hard about
what the underlying "truths" are, which analogies work and which don't,
and maybe even to go away and read some of the literature that others
have recommended...

2.4 For a relatively young field (at least under the name of Corpus
Linguistics), I think this is very healthy. It has certainly given
me much food for thought (for example, "in what way does corpus linguistics
currently lack or ignore "explanatory adequacy"?), and added to the
backlog of "stuff I must get round to when I retire"; for both of which
I am truly thankful!

2.5 BTW, the notion of "transformational" and "generative" grammar
2.5 BTW, the notion of "transformational" and "generative" grammar
did not begin with Chomsky (although maybe the English terms did),
nor even in the current century. Panini's
grammar of Sanskrit (which in turn borrowed from many previous scholars'
work) embodies both principles, although it actually starts with a
"functional" basis. Very crudely: (a) "What do you want to express?", then
(b) "start with this form", and (c) depending on your specific contextual
requirements and preferences, which you process your way through rather
like a multiple-choice questionnaire, "execute these transformations on
that form". The entire description of the grammatical system strongly
resembles a computer program or flowchart, with structural features
such as "if...then", "go to", "do this repeatedly until...", etc.
To quote Goodness Gracious Me: "Grammar? Indian!".

:--)

Ramesh Krishnamurthy
Honorary Research Fellow, Birmingham University
Honorary Research Fellow, Wolverhampton University
Consultant, Cobuild and Bank of English Corpus, Collins Dictionaries