[Corpora-List] Surprisingly large MI scores

David Beavan d.beavan at englang.arts.gla.ac.uk
Tue Sep 8 08:49:20 UTC 2009


I agree with Mark's figures. Using the whole of the BNC, my clouds give
wreak and havoc an MI of 14.00. But that's not an n-gram, merely
collocates within a span of 5 from wreak:

http://www.scottishcorpus.ac.uk/corpus/bnc/collocatecloud.php?word=wreak

Dave

-- 
David Beavan
Computing Manager
Scottish Corpus of Texts & Speech
Corpus of Modern Scottish Writing

University of Glasgow, 6 University Gardens, Glasgow G12 8QQ
+44 (0)141 330 2382
http://www.scottishcorpus.ac.uk/
The University of Glasgow, charity number SC004401

-----Original Message-----
From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf
Of Mark Davies
Sent: 07 September 2009 23:53
To: Brett Reynolds
Subject: Re: [Corpora-List] Surprisingly large MI scores

Brett,

I get quite different scores for the Academic-only queries, and these
are much more in line with what one would expect. 

BNC (corpus.byu.edu/bnc)
the content of  2.75 (vs. your 0.34)
is one of the 2.18 (vs. your -0.24)
a kind of 3.52
the extent to which 2.18
in other words 4.39
a great deal of 2.94

Corpus of Contemporary American English (www.americancorpus.org) the
content of 2.94 is one of the 2.35 a kind of 3.69 the extent to which
2.35 in other words 4.45 a great deal of 3.11

Also, the MI scores from the BYU-BNC agree quite nicely with the MI from
the BNC via Sketch Engine and BNCweb. For example, for [ * havoc ],
BYU-BNC gives 16.9 for [wreak], Sketch Engine gives 17.0, and BNCweb
gives 17.1. So apparently they are all using the same MI formula
correctly. (BTW, the calculated corpus size might account for the very
small differences, since the number of "words" in the BNC differs
slightly depending on what counts as a "word").

As you've mentioned, these MI scores are much, much lower than what
Ellis et al have found. Even with a very highly idiomatic phrase like
"run amok" or "wreak havoc", MI scores are almost never above 16-17 --
certainly not up in the 60-80 range.

Feel free to email me if you need help with these.

Mark

============================================
Mark Davies
Professor of (Corpus) Linguistics
Brigham Young University
(phone) 801-422-9168 / (fax) 801-422-0906

http://davies-linguistics.byu.edu

** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
============================================ 

> -----Original Message-----
> From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf

> Of Brett Reynolds
> Sent: Saturday, September 05, 2009 6:49 AM
> To: Corpora List
> Subject: [Corpora-List] Surprisingly large MI scores
> 
> "Formulaic language in native speakers: Triangulating 
> psycholinguistics, corpus linguistics, and education" by Nick C. Ellis

> and Rita Simpson-Vlach was recently published in _Corpus Linguistics 
> and Linguistic Theory_ <http://www.reference-
> global.com/doi/abs/10.1515/CLLT.2009.003
>  >.
> 
> Therein is a table of n-grams in three columns: low, medium, and high 
> MI scores. I'm going from memory, but the authors consider roughly
> MI=3 as low, MI=6 as medium, and MI=12 as high.
> 
> I have only a very rudimentary understanding of MI scores, but my 
> understanding is that an MI of 3 indicates a strong collocation, so I 
> wrote to Nick and asked him about it. He wrote back that:
> 
> -They used Mike Barlow's Collocate.
> -"MI is very sensitive to length of n-gram.
> 
> For example, calculating MI for all 2-9 grams in the whole of BNCBaby 
> occurring 12+ times, for each N we found the median MI, resulting in:
> N Median MI
> 2 2.234
> 3 6.723
> 4 13.085
> 5 20.835
> 6 38.925
> 7 53.612
> 8 69.046
> 9 79.962"
> 
> I've never seen MI scores of that size. Moreover, when I looked at 
> some of the n-grams that appear in the paper using Mark Davies' COCA 
> and BNC interfaces, I came up with much lower numbers. Here are some
> examples: the first is the MI in the entire corpus, and the second is 
> the MI in the academic subcorpus.
> 
> BNC
> the content of  2.99 0.34
> is one of the 2.41 -0.24
> a kind of 4.06 1.41
> the extent to which 2.41 -0.24
> in other words 4.71 2.05
> a great deal of 3.47 0.82
> 
> COCA
> the content of 3.24 0.90
> is one of the 2.66 0.31
> a kind of 4.31 1.97
> the extent to which 2.66 0.31
> in other words 4.83 2.49
> a great deal of 3.73 1.38
> 
> Again, the numbers in the paper are often four times those above. Can 
> anybody help me understand this discrepancy?
> 
> Best,
> Brett
> 
> <http://english-jack.blogspot.com>
> 
> -----------------------
> Brett Reynolds
> English Language Centre
> Humber College Institute of Technology and Advanced Learning Toronto, 
> Ontario, Canada brett.reynolds at humber.ca
> 
> 
> 
> 
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list