[Corpora-List] Surprisingly large MI scores

Mark Davies Mark_Davies at byu.edu
Mon Sep 7 22:53:27 UTC 2009


Brett,

I get quite different scores for the Academic-only queries, and these are much more in line with what one would expect. 

BNC (corpus.byu.edu/bnc)
the content of  2.75 (vs. your 0.34)
is one of the 2.18 (vs. your -0.24)
a kind of 3.52
the extent to which 2.18
in other words 4.39
a great deal of 2.94

Corpus of Contemporary American English (www.americancorpus.org)
the content of 2.94
is one of the 2.35
a kind of 3.69
the extent to which 2.35
in other words 4.45
a great deal of 3.11

Also, the MI scores from the BYU-BNC agree quite nicely with the MI from the BNC via Sketch Engine and BNCweb. For example, for [ * havoc ], BYU-BNC gives 16.9 for [wreak], Sketch Engine gives 17.0, and BNCweb gives 17.1. So apparently they are all using the same MI formula correctly. (BTW, the calculated corpus size might account for the very small differences, since the number of "words" in the BNC differs slightly depending on what counts as a "word").

As you've mentioned, these MI scores are much, much lower than what Ellis et al have found. Even with a very highly idiomatic phrase like "run amok" or "wreak havoc", MI scores are almost never above 16-17 -- certainly not up in the 60-80 range.

Feel free to email me if you need help with these.

Mark

============================================
Mark Davies
Professor of (Corpus) Linguistics
Brigham Young University
(phone) 801-422-9168 / (fax) 801-422-0906

http://davies-linguistics.byu.edu

** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
============================================ 

> -----Original Message-----
> From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of Brett
> Reynolds
> Sent: Saturday, September 05, 2009 6:49 AM
> To: Corpora List
> Subject: [Corpora-List] Surprisingly large MI scores
> 
> "Formulaic language in native speakers: Triangulating
> psycholinguistics, corpus linguistics, and education" by Nick C. Ellis
> and Rita Simpson-Vlach was recently published in _Corpus Linguistics
> and Linguistic Theory_ <http://www.reference-
> global.com/doi/abs/10.1515/CLLT.2009.003
>  >.
> 
> Therein is a table of n-grams in three columns: low, medium, and high
> MI scores. I'm going from memory, but the authors consider roughly
> MI=3 as low, MI=6 as medium, and MI=12 as high.
> 
> I have only a very rudimentary understanding of MI scores, but my
> understanding is that an MI of 3 indicates a strong collocation, so I
> wrote to Nick and asked him about it. He wrote back that:
> 
> -They used Mike Barlow's Collocate.
> -"MI is very sensitive to length of n-gram.
> 
> For example, calculating MI for all 2-9 grams in the whole of BNCBaby
> occurring 12+ times, for each N we found the median MI, resulting in:
> N Median MI
> 2 2.234
> 3 6.723
> 4 13.085
> 5 20.835
> 6 38.925
> 7 53.612
> 8 69.046
> 9 79.962"
> 
> I've never seen MI scores of that size. Moreover, when I looked at
> some of the n-grams that appear in the paper using Mark Davies' COCA
> and BNC interfaces, I came up with much lower numbers. Here are some
> examples: the first is the MI in the entire corpus, and the second is
> the MI in the academic subcorpus.
> 
> BNC
> the content of  2.99 0.34
> is one of the 2.41 -0.24
> a kind of 4.06 1.41
> the extent to which 2.41 -0.24
> in other words 4.71 2.05
> a great deal of 3.47 0.82
> 
> COCA
> the content of 3.24 0.90
> is one of the 2.66 0.31
> a kind of 4.31 1.97
> the extent to which 2.66 0.31
> in other words 4.83 2.49
> a great deal of 3.73 1.38
> 
> Again, the numbers in the paper are often four times those above. Can
> anybody help me understand this discrepancy?
> 
> Best,
> Brett
> 
> <http://english-jack.blogspot.com>
> 
> -----------------------
> Brett Reynolds
> English Language Centre
> Humber College Institute of Technology and Advanced Learning
> Toronto, Ontario, Canada
> brett.reynolds at humber.ca
> 
> 
> 
> 
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list