[Corpora-List] Surprisingly large MI scores

Alexander Clark alexsclark at googlemail.com
Tue Sep 8 06:27:36 UTC 2009


Two points:

1. Wouldn't one expect pointwise MI to linearly increase with length
of n-gram for grammatical subsequences?
assuming the obvious generalisation from 2-grams to n-grams..

2. These are not the actual mutual information, but estimates of the
mutual information, probably from unsmoothed counts.
Taking the ML estimate of the probability and then computing the MI is
a bad way of estimating the MI.



2009/9/5 Brett Reynolds <brett at forsyths.ca>:
> "Formulaic language in native speakers: Triangulating psycholinguistics,
> corpus linguistics, and education" by Nick C. Ellis and Rita Simpson-Vlach
> was recently published in _Corpus Linguistics and Linguistic Theory_
> <http://www.reference-global.com/doi/abs/10.1515/CLLT.2009.003>.
>
> Therein is a table of n-grams in three columns: low, medium, and high MI
> scores. I'm going from memory, but the authors consider roughly MI=3 as low,
> MI=6 as medium, and MI=12 as high.
>
> I have only a very rudimentary understanding of MI scores, but my
> understanding is that an MI of 3 indicates a strong collocation, so I wrote
> to Nick and asked him about it. He wrote back that:
>
> -They used Mike Barlow's Collocate.
> -"MI is very sensitive to length of n-gram.
>
> For example, calculating MI for all 2-9 grams in the whole of BNCBaby
> occurring 12+ times, for each N we found the median MI, resulting in:
> N Median MI
> 2 2.234
> 3 6.723
> 4 13.085
> 5 20.835
> 6 38.925
> 7 53.612
> 8 69.046
> 9 79.962"
>
> I've never seen MI scores of that size. Moreover, when I looked at some of
> the n-grams that appear in the paper using Mark Davies' COCA and BNC
> interfaces, I came up with much lower numbers. Here are some examples: the
> first is the MI in the entire corpus, and the second is the MI in the
> academic subcorpus.
>
> BNC
> the content of  2.99 0.34
> is one of the 2.41 -0.24
> a kind of 4.06 1.41
> the extent to which 2.41 -0.24
> in other words 4.71 2.05
> a great deal of 3.47 0.82
>
> COCA
> the content of 3.24 0.90
> is one of the 2.66 0.31
> a kind of 4.31 1.97
> the extent to which 2.66 0.31
> in other words 4.83 2.49
> a great deal of 3.73 1.38
>
> Again, the numbers in the paper are often four times those above. Can
> anybody help me understand this discrepancy?
>
> Best,
> Brett
>
> <http://english-jack.blogspot.com>
>
> -----------------------
> Brett Reynolds
> English Language Centre
> Humber College Institute of Technology and Advanced Learning
> Toronto, Ontario, Canada
> brett.reynolds at humber.ca
>
>
>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
> <<
>
> Email has been scanned for viruses by Altman Technologies' email management
> service - www.altman.co.uk/emailsystems
>
>>>
>



-- 
Alex Clark

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list