<!doctype html public "-//W3C//DTD W3 HTML//EN">
<html><head><style type="text/css"><!--
blockquote, dl, ul, ol, li { padding-top: 0 ; padding-bottom: 0 }
--></style><title>summary: corpus evidence that runs counter to
intuitio</title></head><body>
<div>Dear corpus list subscribers<br>
</div>
<div>I'd like to thank all those people who have sent replies to my
question about corpus evidence that runs counter to intuition. A
number of people have asked me to write a summary - so here's what I
received (in chronological order):</div>
<div><br></div>
<div><b>Yorick Wilks</b> mentioned that Paul Jacobs (in his
information extraction presentation ) used to point out that, in
English, 'television' nearly always means 'the medium' in corpus
counts and not 'the TV set' which is what most of us would
(unreflectingly) think.</div>
<div><br></div>
<div><b>Eric Atwell</b> wrote that many people (probably especially
non-linguists) are surprised by the high frequencies of the
four-letter word<i> fuck</i> in the spoken part of the BNC and thinks
that this is "evidence that it is far more natural and normal in
ordinary speech than many expect".</div>
<div><br></div>
<div>I had a look in the BNC and here's some data:</div>
<div>In its various forms,<i> fuck</i> occurs 2,814 times in the
spoken part of BNC World Edition (272 instances pmw). It is uttered
almost three times as often by male speakers than by female speakers
(359pmw vs. 137pmw). So yes, it IS frequent. ;-) But it may also be
important to note that 985 of the spoken instances occur in one single
file (KDA) which is a collection of conversations between aircraft
engineers. In this file, the frequency is a whooping 13000 instances
pmw! Other files with high numbers of (fuck|fucks|fucking|fucked) are
KD9 (110 instances, 7909pmw), KE1 (141 instances, 6713pmw), KDN (231
instances, 4986pmw), KCU (164 instances, 3045pmw), KP4 (105 instances,
2547pmw), G01 (104 instances, 2548pmw), and FP6 (100 instances,
2520pmw). Thus, these 8 files alone cover 1940 - or 69 per cent - of
all relevant instances.</div>
<div><br></div>
<div><b>Jasper Holmes</b> notes that intuition often fails in
connection with well known 'grammar errors'. For example, around
half of the cases of NONE, NEITHER, NO+np (as subject of a present
tense verb) in the ICE-GB corpus appeared with plural verbal agreement
(your grammar book will tell you they are singular). Similarly for
examples like _a bag of letters_.</div>
<div><br></div>
<div><b>John McKenny</b> wrote:</div>
<div>What springs to mind for me is the use of 'would' to talk about
past habits. e.g. "when I was young I would go to Mass each
morning with my grandmother". Before the advent of COBUILD this
was considered to be literary and less colloquial than "used
to". This was my intuition and the general intuition of the EFL
community witness countless ELT textbooks and grammars. I taught
"used to" to countless pre-intermediate students leaving
"would" for advanced students. COBUILD turned this
upside down, I think.</div>
<div><br></div>
<div><b>Philip Resnik</b> pointed out that "Talke Macfarland has
done some very interesting corpus-based work on passive cognate object
constructions, showing that corpus evidence contradicts some
introspection-based claims in the literature about
grammaticality."</div>
<div><br></div>
<div><b>John Williams</b> wrote that "In the Bank of English, by
far the most frequent meaning of 'bash' (any</div>
<div>part of speech) is 'party' whereas I think most native speakers
would intuitively go for 'hit, beat up' (informal). This could be
explained by the large news media component of the BofE ('bash =
party' is very much a 'media' word) or maybe it's 'really' the most
frequent meaning (whatever that means).</div>
<div><br></div>
<div><b>Guy Aston</b> replied to this statement and pointed out that
the BNC cannot support this data. He writes:</div>
<div>On a rough count, out of 272 occurrences of "bash", 97
are verbs meaning "hit" and 19 are forms of the
delexicalised "have a bash", as well as are 6 other nominal
uses meaning "a hit". 59 are proper nouns (characters called
"Bash"), leaving only 80ish as nouns meaning
"party".. And then the verb bash also has other forms
...</div>
<div><br></div>
<div><b>John Williams</b> mentioned two further points:</div>
<div>The large news component [in the Bank of English] also explains
things like the main verb collocates of 'radio station' being things
like 'capture' or 'take over', rather than the more intuitive 'listen
to' or 'tune into'.</div>
<div>And also there are the well-known cases like 'give', where the
delexicalized meanings ('give a smile', etc) are more frequent than
'hand over, present'; and 'see = understand' which is more frequent
than 'see = perceive with eyes'.</div>
<div><br></div>
<div>Again, I checked the BNC and looked for verb collocates of<i>
radio station</i> and<i> radio stations</i> (which together occur 509
times in the BNC) within a window of -3 to +3. The result is ranked by
log-likelihood value and the lemmatization is based on the Lancaster
scheme provided with the BNC World Edition. Only node-collocate pairs
which occurred at least 3 times were considered for the calculation.
Sorry for the formatting - hope you can make sense of this
table.</div>
<div><br></div>
<div><br></div>
<div
>--------------------------------------------------------------------<span
></span>----------</div>
<div>No.<x-tab> </x-tab>Lemma<x-tab>
</x-tab><x-tab>
</x-tab>n<x-tab> </x-tab>n
coll.<x-tab> </x-tab>n texts<x-tab> </x-tab>log-likelihood value</div>
<div
>--------------------------------------------------------------------<span
></span>----------<br>
1<x-tab> </x-tab>be_VERB<x-tab>
</x-tab><x-tab>
</x-tab>3244400<x-tab> </x-tab>64<x-tab>
</x-tab>43<x-tab> </x-tab>93.790871<br>
2<x-tab> </x-tab>broadcast_VERB<x-tab>
</x-tab>970<x-tab>
</x-tab>7<x-tab>
</x-tab>6<x-tab>
</x-tab>89.244032<br>
3<x-tab>
</x-tab>occupy_VERB<x-tab>
</x-tab>4379<x-tab>
</x-tab>6<x-tab>
</x-tab>6<x-tab>
</x-tab>56.542749<br>
4<x-tab>
</x-tab>have_VERB<x-tab>
</x-tab>1319155<x-tab> </x-tab>30<x-tab>
</x-tab>23<x-tab> </x-tab>49.442746<br>
5<x-tab>
</x-tab>own_VERB<x-tab>
</x-tab><x-tab>
</x-tab>6372<x-tab>
</x-tab>5<x-tab>
</x-tab>5<x-tab>
</x-tab>41.556727<br>
6<x-tab>
</x-tab>play_VERB<x-tab>
</x-tab>37632<x-tab>
</x-tab>6<x-tab>
</x-tab>5<x-tab>
</x-tab>31.023832<br>
7<x-tab>
</x-tab>report_VERB<x-tab>
</x-tab>18747<x-tab>
</x-tab>5<x-tab>
</x-tab>5<x-tab>
</x-tab>30.875074<br>
8<x-tab>
</x-tab>run_VERB<x-tab>
</x-tab><x-tab>
</x-tab>39201<x-tab>
</x-tab>6<x-tab>
</x-tab>6<x-tab>
</x-tab>30.547817<br>
9<x-tab>
</x-tab>use_VERB<x-tab>
</x-tab><x-tab>
</x-tab>105881<x-tab>
</x-tab>8<x-tab>
</x-tab>5<x-tab>
</x-tab>29.948009<br>
10<x-tab>
</x-tab>seize_VERB<x-tab>
</x-tab>2505<x-tab>
</x-tab>3<x-tab>
</x-tab>3<x-tab>
</x-tab>27.448364<br>
11<x-tab>
</x-tab>say_VERB<x-tab>
</x-tab><x-tab>
</x-tab>318281<x-tab> </x-tab>11<x-tab>
</x-tab>10<x-tab> </x-tab>25.577838<br>
12<x-tab> </x-tab>establish_VERB<x-tab>
</x-tab>17397<x-tab>
</x-tab>4<x-tab>
</x-tab>4<x-tab>
</x-tab>23.526801<br>
13<x-tab>
</x-tab>was_VERB<x-tab>
</x-tab><x-tab>
</x-tab>883602<x-tab> </x-tab>16<x-tab>
</x-tab>14<x-tab> </x-tab>20.391751<br>
14<x-tab>
</x-tab>operate_VERB<x-tab>
</x-tab>10179<x-tab>
</x-tab>3<x-tab>
</x-tab>3<x-tab>
</x-tab>19.103241<br>
15<x-tab>
</x-tab>hear_VERB<x-tab>
</x-tab>34747<x-tab>
</x-tab>3<x-tab>
</x-tab>2<x-tab>
</x-tab>11.959554<br>
16<x-tab>
</x-tab>take_VERB<x-tab>
</x-tab>173956<x-tab>
</x-tab>5<x-tab>
</x-tab>5<x-tab>
</x-tab>10.003759<br>
17<x-tab>
</x-tab>call_VERB<x-tab>
</x-tab><x-tab>
</x-tab>52265<x-tab>
</x-tab>3<x-tab>
</x-tab>3<x-tab>
</x-tab>9.669488<br>
18<x-tab>
</x-tab>get_VERB<x-tab>
</x-tab><x-tab>
</x-tab>213722<x-tab>
</x-tab>5<x-tab>
</x-tab>5<x-tab>
</x-tab>8.305941<br>
19<x-tab>
</x-tab>could_VERB<x-tab>
</x-tab>160161<x-tab>
</x-tab>4<x-tab>
</x-tab>4<x-tab>
</x-tab>7.064107<br>
20<x-tab>
</x-tab>know_VERB<x-tab>
</x-tab>178522<x-tab>
</x-tab>4<x-tab>
</x-tab>4<x-tab>
</x-tab>6.362790<br>
21<x-tab>
</x-tab>give_VERB<x-tab>
</x-tab>125302<x-tab>
</x-tab>3<x-tab>
</x-tab>3<x-tab>
</x-tab>5.087929<br>
22<x-tab> </x-tab>go_VERB<x-tab>
</x-tab><x-tab>
</x-tab>227069<x-tab>
</x-tab>4<x-tab>
</x-tab>4<x-tab>
</x-tab>4.879994<br>
23<x-tab> </x-tab>do_VERB<x-tab>
</x-tab><x-tab>
</x-tab>538558<x-tab>
</x-tab>6<x-tab>
</x-tab>6<x-tab>
</x-tab>3.632087<br>
24<x-tab>
</x-tab>will_VERB<x-tab>
</x-tab><x-tab>
</x-tab>329392<x-tab>
</x-tab>4<x-tab>
</x-tab>4<x-tab>
</x-tab>2.835328<br>
---------------------------------------------------------------------<span
></span>---------<br>
</div>
<div>Since my calculation is based on single word collocates, I cannot
give any information about "the more intuitive [verb -
preposition combinations] 'listen to' or 'tune into'" - but in
any case, the verbs<i> listen</i> and<i> tune</i> are not found in the
above table... ;-)</div>
<div><br></div>
<div><b>Bob Krovetz</b> wrote:</div>
<div>In my work on morphology I would sometimes come across examples
that made sense, but I wouldn't have thought of it beforehand. I
studied corpus data in order to decide which morphological variants I
should reduce to a root for purposes of information retrieval (this is
called "stemming"). I tried to avoid any groupings
that would create ambiguity. For example, I didn't reduce
"gravitation" to "gravity" because "gravity"
can also mean "serious" (the gravity of the crime), which is
the predominant meaning of "gravity" in legal text. So
should "accelerators" be reduced to
"accelerator"? I found that "accelerator"
refers to either a car accelerator or a nuclear particle
accelerator in newspaper text. But "accelerators"
referred only to nuclear particle accelerators. We just don't
talk about more than one car accelerator. It's possible to do
so, but very unlikely (at least within newspaper text).
I'm not saying that "accelerator" is limited to those two
meanings either - those were just the ones I found in the corpora I
studied.</div>
<div><br></div>
<div>I'm still looking for more examples - please keep them coming and
I'll post a second summary...</div>
<div><br></div>
<div>Sebastian Hoffmann</div>
<div>University of Zurich</div>
<x-sigsep><pre>--
</pre></x-sigsep>
</body>
</html>