Corpora: summary: corpus evidence that runs counter to intuition?

Wed Oct 24 12:02:29 UTC 2001

Dear corpus list subscribers

I'd like to thank all those people who have sent replies to my
question about corpus evidence that runs counter to intuition. A
number of people have asked me to write a summary - so here's what I
received (in chronological order):

Yorick Wilks mentioned that  Paul Jacobs (in his information
extraction presentation ) used to point out that, in English,
'television' nearly always means 'the medium' in corpus counts and
not 'the TV set' which is what most of us would (unreflectingly)
think.

Eric Atwell wrote that many people (probably especially
non-linguists) are surprised by the high frequencies of the
four-letter word fuck in the spoken part of the BNC and thinks that
this is "evidence that it is far more natural and normal in ordinary
speech than many expect".

I had a look in the BNC and here's some data:
In its various forms, fuck occurs 2,814 times in the spoken part of
BNC World Edition (272 instances pmw). It is uttered almost three
times as often by male speakers than by female speakers (359pmw vs.
137pmw). So yes, it IS frequent. ;-) But it may also be important to
note that 985 of the spoken instances occur in one single file (KDA)
which is a collection of conversations between aircraft engineers. In
this file, the frequency is a whooping 13000 instances pmw! Other
files with high numbers of (fuck|fucks|fucking|fucked) are KD9 (110
instances, 7909pmw), KE1 (141 instances, 6713pmw), KDN (231
instances, 4986pmw), KCU (164 instances, 3045pmw), KP4 (105
instances, 2547pmw), G01 (104 instances, 2548pmw), and FP6 (100
instances, 2520pmw). Thus, these 8 files alone cover 1940 - or 69 per
cent - of all relevant instances.

Jasper Holmes notes that intuition often fails in connection with
well known 'grammar errors'.  For example, around half of the cases
of NONE, NEITHER, NO+np (as subject of a present tense verb) in the
ICE-GB corpus appeared with plural verbal agreement (your grammar
book will tell you they are singular). Similarly for examples like _a
bag of letters_.

John McKenny wrote:
What springs to mind for me is the use of 'would' to talk about past
habits. e.g. "when I was young I would go to Mass each morning with
my grandmother". Before the advent of COBUILD this was considered to
be literary and less colloquial than "used to". This was my intuition
and the general intuition of the EFL community witness countless ELT
textbooks and grammars. I taught "used to" to countless
pre-intermediate students leaving "would" for advanced students.
COBUILD  turned this upside down, I think.

Philip Resnik pointed out that "Talke Macfarland has done some very
interesting corpus-based work on passive cognate object
constructions, showing that corpus evidence contradicts some
introspection-based claims in the literature about grammaticality."

John Williams wrote that "In the Bank of English, by far the most
frequent meaning of 'bash' (any
part of speech) is 'party' whereas I think most native speakers would
intuitively go for 'hit, beat up' (informal). This could be explained
by the large news media component of the BofE ('bash = party' is very
much a 'media' word) or maybe it's 'really' the most frequent meaning
(whatever that means).

Guy Aston replied to this statement and pointed out that the BNC
cannot support this data. He writes:
On a rough count, out of 272 occurrences of "bash", 97 are verbs
meaning "hit" and 19 are forms of the delexicalised "have a bash", as
well as are 6 other nominal uses meaning "a hit". 59 are proper nouns
(characters called "Bash"), leaving only 80ish as nouns meaning
"party".. And then the verb bash also has other forms ...

John Williams mentioned two further points:
The large news component [in the Bank of English] also explains
things like the main verb collocates of 'radio station' being things
like 'capture' or 'take over', rather than the more intuitive 'listen
to' or 'tune into'.
And also there are the well-known cases like 'give', where the
delexicalized meanings ('give a smile', etc) are more frequent than
'hand over, present'; and 'see = understand' which is more frequent
than 'see = perceive with eyes'.

Again, I checked the BNC and looked for verb collocates of radio
station and radio stations (which together occur 509 times in the
BNC) within a window of -3 to +3. The result is ranked by
log-likelihood value and the lemmatization is based on the Lancaster
scheme provided with the BNC World Edition. Only node-collocate pairs
which occurred at least 3 times were considered for the calculation.
Sorry for the formatting - hope you can make sense of this table.

------------------------------------------------------------------------------
No.	Lemma		n	n coll.	n texts	log-likelihood value
------------------------------------------------------------------------------
1	be_VERB		3244400	64	43	93.790871
2	broadcast_VERB	970	7	6	89.244032
3	occupy_VERB	4379	6	6	56.542749
4	have_VERB	1319155	30	23	49.442746
5	own_VERB		6372	5	5	41.556727
6	play_VERB	37632	6	5	31.023832
7	report_VERB	18747	5	5	30.875074
8	run_VERB		39201	6	6	30.547817
9	use_VERB		105881	8	5	29.948009
10	seize_VERB	2505	3	3	27.448364
11	say_VERB		318281	11	10	25.577838
12	establish_VERB	17397	4	4	23.526801
13	was_VERB		883602	16	14	20.391751
14	operate_VERB	10179	3	3	19.103241
15	hear_VERB	34747	3	2	11.959554
16	take_VERB	173956	5	5	10.003759
17	call_VERB		52265	3	3	9.669488
18	get_VERB		213722	5	5	8.305941
19	could_VERB	160161	4	4	7.064107
20	know_VERB	178522	4	4	6.362790
21	give_VERB	125302	3	3	5.087929
22	go_VERB		227069	4	4	4.879994
23	do_VERB		538558	6	6	3.632087
24	will_VERB		329392	4	4	2.835328
------------------------------------------------------------------------------

Since my calculation is based on single word collocates, I cannot
give any information about  "the more intuitive [verb - preposition
combinations] 'listen to' or 'tune into'" - but in any case, the
verbs listen and tune are not found in the above table... ;-)

Bob Krovetz wrote:
In my work on morphology I would sometimes come across examples that
made sense, but I wouldn't have thought of it beforehand.  I studied
corpus data in order to decide which morphological variants I should
reduce to a root for purposes of information retrieval (this is
called "stemming").  I tried to avoid any groupings that would create
ambiguity.  For example, I didn't reduce "gravitation" to "gravity"
because "gravity" can also mean "serious" (the gravity of the crime),
which is the predominant meaning of "gravity" in legal text.  So
should "accelerators" be reduced to "accelerator"?  I found that
"accelerator" refers to either a car  accelerator or a nuclear
particle accelerator in newspaper text.  But  "accelerators" referred
only to nuclear particle accelerators.  We just don't  talk about
more than one car accelerator.  It's possible to do so, but very
unlikely (at least within newspaper text).  I'm not saying that
"accelerator" is limited to those two meanings either - those were
just the ones I found in the corpora I studied.

I'm still looking for more examples - please keep them coming and
I'll post a second summary...

Sebastian Hoffmann
University of Zurich
--
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20011024/991b202c/attachment.htm>