Corpora: summary: corpus evidence that runs counter to intuition?
Sebastian Hoffmann
sebhoff at es.unizh.ch
Wed Oct 24 12:02:29 UTC 2001
Dear corpus list subscribers
I'd like to thank all those people who have sent replies to my
question about corpus evidence that runs counter to intuition. A
number of people have asked me to write a summary - so here's what I
received (in chronological order):
Yorick Wilks mentioned that Paul Jacobs (in his information
extraction presentation ) used to point out that, in English,
'television' nearly always means 'the medium' in corpus counts and
not 'the TV set' which is what most of us would (unreflectingly)
think.
Eric Atwell wrote that many people (probably especially
non-linguists) are surprised by the high frequencies of the
four-letter word fuck in the spoken part of the BNC and thinks that
this is "evidence that it is far more natural and normal in ordinary
speech than many expect".
I had a look in the BNC and here's some data:
In its various forms, fuck occurs 2,814 times in the spoken part of
BNC World Edition (272 instances pmw). It is uttered almost three
times as often by male speakers than by female speakers (359pmw vs.
137pmw). So yes, it IS frequent. ;-) But it may also be important to
note that 985 of the spoken instances occur in one single file (KDA)
which is a collection of conversations between aircraft engineers. In
this file, the frequency is a whooping 13000 instances pmw! Other
files with high numbers of (fuck|fucks|fucking|fucked) are KD9 (110
instances, 7909pmw), KE1 (141 instances, 6713pmw), KDN (231
instances, 4986pmw), KCU (164 instances, 3045pmw), KP4 (105
instances, 2547pmw), G01 (104 instances, 2548pmw), and FP6 (100
instances, 2520pmw). Thus, these 8 files alone cover 1940 - or 69 per
cent - of all relevant instances.
Jasper Holmes notes that intuition often fails in connection with
well known 'grammar errors'. For example, around half of the cases
of NONE, NEITHER, NO+np (as subject of a present tense verb) in the
ICE-GB corpus appeared with plural verbal agreement (your grammar
book will tell you they are singular). Similarly for examples like _a
bag of letters_.
John McKenny wrote:
What springs to mind for me is the use of 'would' to talk about past
habits. e.g. "when I was young I would go to Mass each morning with
my grandmother". Before the advent of COBUILD this was considered to
be literary and less colloquial than "used to". This was my intuition
and the general intuition of the EFL community witness countless ELT
textbooks and grammars. I taught "used to" to countless
pre-intermediate students leaving "would" for advanced students.
COBUILD turned this upside down, I think.
Philip Resnik pointed out that "Talke Macfarland has done some very
interesting corpus-based work on passive cognate object
constructions, showing that corpus evidence contradicts some
introspection-based claims in the literature about grammaticality."
John Williams wrote that "In the Bank of English, by far the most
frequent meaning of 'bash' (any
part of speech) is 'party' whereas I think most native speakers would
intuitively go for 'hit, beat up' (informal). This could be explained
by the large news media component of the BofE ('bash = party' is very
much a 'media' word) or maybe it's 'really' the most frequent meaning
(whatever that means).
Guy Aston replied to this statement and pointed out that the BNC
cannot support this data. He writes:
On a rough count, out of 272 occurrences of "bash", 97 are verbs
meaning "hit" and 19 are forms of the delexicalised "have a bash", as
well as are 6 other nominal uses meaning "a hit". 59 are proper nouns
(characters called "Bash"), leaving only 80ish as nouns meaning
"party".. And then the verb bash also has other forms ...
John Williams mentioned two further points:
The large news component [in the Bank of English] also explains
things like the main verb collocates of 'radio station' being things
like 'capture' or 'take over', rather than the more intuitive 'listen
to' or 'tune into'.
And also there are the well-known cases like 'give', where the
delexicalized meanings ('give a smile', etc) are more frequent than
'hand over, present'; and 'see = understand' which is more frequent
than 'see = perceive with eyes'.
Again, I checked the BNC and looked for verb collocates of radio
station and radio stations (which together occur 509 times in the
BNC) within a window of -3 to +3. The result is ranked by
log-likelihood value and the lemmatization is based on the Lancaster
scheme provided with the BNC World Edition. Only node-collocate pairs
which occurred at least 3 times were considered for the calculation.
Sorry for the formatting - hope you can make sense of this table.
------------------------------------------------------------------------------
No. Lemma n n coll. n texts log-likelihood value
------------------------------------------------------------------------------
1 be_VERB 3244400 64 43 93.790871
2 broadcast_VERB 970 7 6 89.244032
3 occupy_VERB 4379 6 6 56.542749
4 have_VERB 1319155 30 23 49.442746
5 own_VERB 6372 5 5 41.556727
6 play_VERB 37632 6 5 31.023832
7 report_VERB 18747 5 5 30.875074
8 run_VERB 39201 6 6 30.547817
9 use_VERB 105881 8 5 29.948009
10 seize_VERB 2505 3 3 27.448364
11 say_VERB 318281 11 10 25.577838
12 establish_VERB 17397 4 4 23.526801
13 was_VERB 883602 16 14 20.391751
14 operate_VERB 10179 3 3 19.103241
15 hear_VERB 34747 3 2 11.959554
16 take_VERB 173956 5 5 10.003759
17 call_VERB 52265 3 3 9.669488
18 get_VERB 213722 5 5 8.305941
19 could_VERB 160161 4 4 7.064107
20 know_VERB 178522 4 4 6.362790
21 give_VERB 125302 3 3 5.087929
22 go_VERB 227069 4 4 4.879994
23 do_VERB 538558 6 6 3.632087
24 will_VERB 329392 4 4 2.835328
------------------------------------------------------------------------------
Since my calculation is based on single word collocates, I cannot
give any information about "the more intuitive [verb - preposition
combinations] 'listen to' or 'tune into'" - but in any case, the
verbs listen and tune are not found in the above table... ;-)
Bob Krovetz wrote:
In my work on morphology I would sometimes come across examples that
made sense, but I wouldn't have thought of it beforehand. I studied
corpus data in order to decide which morphological variants I should
reduce to a root for purposes of information retrieval (this is
called "stemming"). I tried to avoid any groupings that would create
ambiguity. For example, I didn't reduce "gravitation" to "gravity"
because "gravity" can also mean "serious" (the gravity of the crime),
which is the predominant meaning of "gravity" in legal text. So
should "accelerators" be reduced to "accelerator"? I found that
"accelerator" refers to either a car accelerator or a nuclear
particle accelerator in newspaper text. But "accelerators" referred
only to nuclear particle accelerators. We just don't talk about
more than one car accelerator. It's possible to do so, but very
unlikely (at least within newspaper text). I'm not saying that
"accelerator" is limited to those two meanings either - those were
just the ones I found in the corpora I studied.
I'm still looking for more examples - please keep them coming and
I'll post a second summary...
Sebastian Hoffmann
University of Zurich
--
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20011024/991b202c/attachment.htm>
More information about the Corpora
mailing list