Corpora: Evidence and intuition

Patrick Hanks patrick at lingomotors.com
Wed Oct 31 23:32:10 UTC 2001


A late contribution to the discussion sparked by Sebastian Hoffmann:

I recently asked a few colleagues who are not corpus linguists to make
up 
a couple of natural sentences using the word "total" as verb. The
answers 
typically fall into two classes:

	1. [[Driver]] total [[Vehicle]]
	   e.g. Carina totaled the car. 

	2. [[Person]] total [[Number]]
	   e.g. John totaled the column of figures. 

In the British and American corpora that we are currently using (in
particular 
BNC, Reuters, and 4 years of AP), sense 1 accounts for less than 1% of
uses 
of the verb and sense 2 is even rarer - perfectly plausible, but next to
non-
existent. 

Over 98% of corpus uses of this verb fall into the following pattern:

	3. [[Entity (often plural)]] total [[Number | Amount]]
	   e.g. Sales totaled 6 million.

Why did this *very* common pattern of use not spring immediately to the 
minds of ordinary native speakers of british or American English?
Hypotheses 
include: 

	a) Introspection as a technique favors human subject roles.
	b) 3 is really a copula, "not a real verb".
	c) There is an inverse relationship between cognitive salience
and 
	    social salience 

Re 3, see (Hanks 1990), where I argued that people register the odd or
unusual 
and fail to register what we do regularly or continuously.  (Think of
someone 
putting his/her hand on your arm.  Now think of someone having had
his/her hand 
on your  arm all afternoon.)

Whatever the reason, the phenomenon is a familiar one in lexical
analysis, 
first noticed by Cobuilders working on the Cobuild 7.3 million word
corpus 
in about 1983. Of course, 'total' is a fairly dramatic example, but
other less 
dramatic cases abound, e.g. the "delexical verbs" (known in America as
"light
verbs).  Ask people to make up examples for common uses of "take" and
very 
few of them will think of [[Duration]]:

	4. How long will it take?

	5. It only took a few minutes.

Interestingly, the phenomenon is occasionally denied by some theoretical

linguists and other intelligent people, corpus evidence to the contrary 
notwithstanding. The opening shot is usually "Your corpus is not 
representative" (?!).  Why do they do this?  Surely it cannot be as
simple 
as wishing to preserve  introspection as a research technique?


Patrick



More information about the Corpora mailing list