Corpora: Robert De Beaugrande, Chomsky, and corpus linguistics

ramesh at clg.bham.ac.uk ramesh at clg.bham.ac.uk
Thu Apr 26 13:56:09 UTC 2001


I completely agree with Robert de Beaugrande's paragraph 65:
"65. As a corpus gets larger, it does not simply show us the same data
multiplied out, eg., each item being ten times as frequent in a corpus
ten times as large. Instead, the larger corpus both turns up fresh data
that did not appear at all in the smaller ones and displays the previous
data in steadily finer delicacy for the range and frequency of the
combinations. Hosts of regularities emerge that escaped notice in smaller
data sets, and would elude unguided intuition and introspection. [...]
Instead of coverage, convergence, and consensus decreasing when natural
language data get rewritten into a formal notation, they are now increasing
when data get treated in their naturally occurring formats."

I only partly agree with paragraph 66:
"66. Conversely, the corpus highlights the improbable and unnatural
quality of invented data like 'John is eager to please'. Typical contexts
of real discourse call for less simple-minded and peremptory utterances.
For example, all three instances of 'eager to please' in the Bank of
English have a Direct Object Target and a more interesting Subject Agent
than the legendary 'John'. eg., the 'government' keen to 'please' powerful
forces such as 'wealth' and 'the Church'
[18] <a government offical who is eager to please the wealth goddess>
[19] <the Sandinstas. The government is eager to please the church>"

The general point "the corpus highlights the improbable and unnatural
quality of invented data" is certainly valid. I have found many invented
examples in dictionaries, other language reference books, and linguistics
textbooks which simply are not reflected in corpus data (to quote just a
couple of examples:
"Don't hold the gun by the business end" in an EFL dictionary,
only one example of "by the business end" in the Bank of English 418
million word corpus:
 You know, the sort produced by the business end of cows.
Out of 151 examples for "the business end", 45 are for "at the business
end", of which 32 are for "at the business end of"; 14 examples of
"on the business end" of which 13 are for "on the business end of";
10 for "with the business end" of which 9 are "with the business end of";
etc. The point is, 102 of 151 examples are in a prepositional phrase,
so the *colligation* PREP+the+business+end is well-attested, but
*not the collocation* with the lexical item "by" representing the
class PREP. More importantly, 115 of the 151 examples are followed
by "of", which is absent in the dictionary example. So the following
colligation has escaped notice, using intuition alone.

Of course, another problem with "Don't hold the gun by the business end"
is its limited contextualizability: how many of us would ever utter such
a sentence (a parent to a child in a lax-gun-law state, a training officer
in a police academy/the army?) and wouldn't we be more emphatic (e.g.
Don't *ever* hold the gun by the business end)?

"The plane overshot the runway" is another dictionary example, but
in such a truncated form, it omits the fatal real-world consequences...
"The arrow/missile overshot the target" rightly introduces the
collocate "target", but most modern corpus examples are for governments
and other organizations overshooting *financial targets*...)

Unfortunately, Robert de Beaugrande's Chomskyan example "John is eager
to please" is in fact well attested in the current Bank of English corpus,
which actually adds even more substance to his point in paragraph 65
about larger corpora. He was evidently using a much earlier - and smaller -
version of the Bank of English, if it only had 3 examples of "eager to
please". I have just checked in the 418 million word Bank of English,
and there are 168 examples of "eager to please". 115 of the 168 are
for the predicative use ("X is eager to please") or appositional use
("X, eager to please, is/does something", or sometimes sentence-initial,
"Eager to please, X is/does something"). On a more delicate level (again
supporting R de B's para 65), proper names (like "John") are much rarer
than personal pronouns, and the phrase is often part of a list of
attributes, e.g. "All were friendly, helpful and eager to please.",
many examples have adverbial modifiers, e.g. desperately eager to please,
touchingly eager to please, or just simple grading adverbs like "so, very,
too"). Another 23 of the 168 examples are for attributive use
("the/an eager-to-please X"),
describing people, their face/expression/behaviour/attitude, etc,
object not mentioned,
usually hyphenated. Only 31 out of 168 examples specify the object,
i.e. who the people are trying to please, in examples such as
"HE was eager to please manager Vialli" or
"a candidate eager to please all sides".
The evidence therefore suggests that "eager to please" is becoming
a fixed phrase, and the direct object of the verb "please" is not
usually mentioned (although in some contexts it may be implied,
or picked up in a looser contextual relationship in a subsequent
sentence). When the object is specified, the phrase seems to lose
its feeling of fixity (an illustration of the ability of speakers
to oscillate between the "idiom principle" and "open choice principle"
as outlined by John Sinclair some years ago).

Terry Murphy is right to suggest that
" Chomsky's comment about corpus lingustics not existing seems
to be a logical response from someone whose whole enterprise would be
undermined by the widespread adoption of real data"
but I am not sure whether his description of the function of
corpus data
"as a mediator of conflicting linguistic judgements"
is adequate or sufficient.
Corpus data is certainly essential for an accurate description of
language.

Ramesh Krishnamurthy
Consultant, Collins Dictionaries and Bank of English corpus
Honorary Research Fellow, Corpus Linguistics, University of Birmingham



More information about the Corpora mailing list