[Corpora-List] Surprisingly large MI scores

Nick Ellis ncellis at umich.edu
Thu Sep 10 18:01:09 UTC 2009


1) Mi2 or MI10?



The MI for a 2-gram (a b) is calculated



             log2(p(a,b)/(p(a)*p(b))



The original application of MI to collocation by Church and Hanks used  
log2, see also the Oakes, Statistics for Corpus Linguistics book.

AntConc calculates MI in this way. Laurence Anthony in the AntConc  
readme file says that for MI he uses “equations described in M.  
Stubbs, Collocations and Semantic Profiles, Functions of Language 2, 1  
(1995)”. In that article,  Mike Stubbs refers to Church and Hanks  
1990. See also  http://ell.phil.tu-chemnitz.de/analysis/collocations.html 
.

We believe that  Collocate, the package we used in the Ellis & Simpson- 
Vlach research, uses log2 too. Although the Collocate manual does not  
describe the formula it uses, it generates the same MI values for  
bigrams as AntConc (give or take a bit, probably depending on  
definitions of what is a word, etc.).



The extension to a 3 gram (a b c) is

             log2(p(a,b,c)/(p(a)*p(b)*p(c))

In our (Ellis & Simpson-Vlach, 2008) written academic corpus (The  
academic writing corpus consisted of Hyland’s (2004) research article  
corpus (1.2 million words), plus selected BNC files (931,000 words)),  
our analyses generated the following MIs for 6 example n-grams in the  
1st column of data. Mark Davies' from BNC (from his latest e-mail to  
the list on this issue) are shown in the second column.

n-gram

our corpus & MIs

Davies BNC MIs

ours using log 10

the content of

5.28

2.75

1.59

is one of the

7.72

2.18

2.32

a kind of

7.02

3.52

2.11

the extent to which

14.81

2.18

4.46

in other words

12.01

4.39

3.61

a great deal of

20.39

2.94

6.14



We believe that Mark is using log10 in his calculations. If we do same  
we get MI values as shown in the final column. Remembering the  
different samples, we're in the same ball-park.



Can you confirm that your interfaces (corpus.byu.edu/bnc/) produce MI  
calculated as log10, Mark?



If some of us are using log2 and others log10, there’s no problem of  
comparability within study, and we need simply use the scaling factor  
of 3.3219 across studies. But there is scope for error if we are not  
clear about our units (remember the Mars Orbiter).

We should be explicit in our reports. Perhaps there is reason to  
standardly report either as MI2 or MI10.



2) Mi is sensitive to n-gram length



MI is sensitive to length of string. Longer formulaic sequences are  
rarer – see Newell on this. As I said in an earlier reply to Brett  
Reynolds, we (Ellis, O'Donnell, Römer, Gries, & Wulff, 2009) have  
calculated MI for all 2-9 grams in the whole of BNCBaby occurring 12+  
times, for each N we found the median MI, resulting in:

N             Median MI


2             2.234


3             6.723


4             13.085


5             20.835


6             38.925


7             53.612


8             69.046


9             79.962



We did this by formula, not using Collocate, but again log2.  The  
slides from this talk can be found on our Michigan Corpus Linguistics  
site:

http://ctr.elicorpora.info/formulaic-language-project

Matt O’Donnell has now repeated these analyses, just to be sure, and  
obtained similar results:

N            MEDIAN

2            2.26475918

3            6.783017633

4            13.17090969

5            20.95527274

6            39.06738859

7            53.78959166

8            69.24835305

9            80.19228017



Word and 2-9 gram lists generated in WordSmith with a frequency  
threshol of 12+ (i.e. 3 per million), then a python script to  
calculate MI using the log2 formula.





The marginal mean MIs (low=3.3, medium = 6.7, high = 11) in Ellis &  
Simpson-Vlach Table 1 average over strings of length n=3, 4, 5 so are  
greater than one might expect for bigrams. We stratified within each  
length, we did not use these values as overall thresholds.

Thanks to Matt O’Donnell and Ute Römer for working this through with  
me.

Hope it clarifies,



             Nick Ellis



Ellis, N. C., O’Donnell, M. B., Römer, U., Gries, S. T., & Wulff, S.  
Measuring the formulaicity of language. Paper presented at AAAL 2009,  
the annual conference of the American Association of Applied  
Linguistics,  Denver, CO. March 21-March 24, 2009.)

Ellis, N. C. & Simpson-Vlach, R. (2009). Formulaic language in native  
speakers: Triangulating psycholinguistics, corpus linguistics, and  
education. Corpus Linguistics and Linguistic Theory, 5, 61-78.

Newell, A. (1990). Unified theories of cognition. Cambridge, MA:  
Harvard University Press.



Nick Ellis

   Professor of Psychology
   Research Scientist, English Language Institute
University of Michigan
Room 1011,
500 East Washington Street
Ann Arbor
MI 48104
USA

e-mail: 		 		ncellis at umich.edu
home page:			Ellis
work  phone:      		734-647-0454
work fax :   			734-763-0369



On Sep 5, 2009, at 8:48 AM, Brett Reynolds wrote:

> "Formulaic language in native speakers: Triangulating  
> psycholinguistics, corpus linguistics, and education" by Nick C.  
> Ellis and Rita Simpson-Vlach was recently published in _Corpus  
> Linguistics and Linguistic Theory_ <http://www.reference-global.com/doi/abs/10.1515/CLLT.2009.003 
> >.
>
> Therein is a table of n-grams in three columns: low, medium, and  
> high MI scores. I'm going from memory, but the authors consider  
> roughly MI=3 as low, MI=6 as medium, and MI=12 as high.
>
> I have only a very rudimentary understanding of MI scores, but my  
> understanding is that an MI of 3 indicates a strong collocation, so  
> I wrote to Nick and asked him about it. He wrote back that:
>
> -They used Mike Barlow's Collocate.
> -"MI is very sensitive to length of n-gram.
>
> For example, calculating MI for all 2-9 grams in the whole of  
> BNCBaby occurring 12+ times, for each N we found the median MI,  
> resulting in:
> N Median MI
> 2 2.234
> 3 6.723
> 4 13.085
> 5 20.835
> 6 38.925
> 7 53.612
> 8 69.046
> 9 79.962"
>
> I've never seen MI scores of that size. Moreover, when I looked at  
> some of the n-grams that appear in the paper using Mark Davies' COCA  
> and BNC interfaces, I came up with much lower numbers. Here are some  
> examples: the first is the MI in the entire corpus, and the second  
> is the MI in the academic subcorpus.
>
> BNC
> the content of  2.99 0.34
> is one of the 2.41 -0.24
> a kind of 4.06 1.41
> the extent to which 2.41 -0.24
> in other words 4.71 2.05
> a great deal of 3.47 0.82
>
> COCA
> the content of 3.24 0.90
> is one of the 2.66 0.31
> a kind of 4.31 1.97
> the extent to which 2.66 0.31
> in other words 4.83 2.49
> a great deal of 3.73 1.38
>
> Again, the numbers in the paper are often four times those above.  
> Can anybody help me understand this discrepancy?
>
> Best,
> Brett
>
> <http://english-jack.blogspot.com>
>
> -----------------------
> Brett Reynolds
> English Language Centre
> Humber College Institute of Technology and Advanced Learning
> Toronto, Ontario, Canada
> brett.reynolds at humber.ca
>
>
>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
Brett,

I get quite different scores for the Academic-only queries, and these  
are much more in line with what one would expect.

BNC (corpus.byu.edu/bnc)
the content of  2.75 (vs. your 0.34)
is one of the 2.18 (vs. your -0.24)
a kind of 3.52
the extent to which 2.18
in other words 4.39
a great deal of 2.94

Corpus of Contemporary American English (www.americancorpus.org)
the content of 2.94
is one of the 2.35
a kind of 3.69
the extent to which 2.35
in other words 4.45
a great deal of 3.11

Also, the MI scores from the BYU-BNC agree quite nicely with the MI  
from the BNC via Sketch Engine and BNCweb. For example, for [ *  
havoc ], BYU-BNC gives 16.9 for [wreak], Sketch Engine gives 17.0, and  
BNCweb gives 17.1. So apparently they are all using the same MI  
formula correctly. (BTW, the calculated corpus size might account for  
the very small differences, since the number of "words" in the BNC  
differs slightly depending on what counts as a "word").

As you've mentioned, these MI scores are much, much lower than what  
Ellis et al have found. Even with a very highly idiomatic phrase like  
"run amok" or "wreak havoc", MI scores are almost never above 16-17 --  
certainly not up in the 60-80 range.

Feel free to email me if you need help with these.

Mark

============================================
Mark Davies
Professor of (Corpus) Linguistics
Brigham Young University
(phone) 801-422-9168 / (fax) 801-422-0906
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20090910/6f3902d0/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list