[Corpora-List] Surprisingly large MI scores

Alexander Clark alexsclark at googlemail.com
Tue Sep 29 06:21:08 UTC 2009


The problem is whether you use probabilities or counts.
If you use counts, then you need to divide by N -- p(x) is appoximated
by f(x)/N which gives you the
factor of $N^{s-1}$

p(x_1 \dots x_s)/(p(x_1) * \dots p(x_s)  = N^{s-1} f(x_1 \dots x_s) /
f(x_1) \dots f(x_s)

Span here is just the number of words -- the n in the n-gram, and not the gap.

In any event this is using the unsmoothed counts, which will
systematically overestimate the MI.

Alex


2009/9/28 Mark Davies <Mark_Davies at byu.edu>:
> Michael B. gave the MI formula from COLLOCATES as:
>
> MI = log2 ( ( N^(s-1) * f (x y) ) / ( f (x) * f (y) ) )
>
> Brett R. gives:
>
> MI = log2 ( ( N * f (x y) ) / ( f (x) * f (y) ) )       ( where is the span
> ?)
>
> I use (http://corpus.byu.edu):
>
> MI = log10 ( ( N * f (x,y) ) / ( f (x) * f (y) * S ) ) / log(2)
> (divide by log(2), since LOG in SQL Server is base 10)
>
>
>
> where N = corpus size and S = span size.
>
>
>
> This is apparently the same or quite similar to what is used for BNCweb. The
> following are the MI scores from BNCweb and BYU-BNC
> (http://corpus.byu.edu/bnc) for collocates of "purple" (span = 3L / 3R):
>
>
>
> collocate BNCweb   BYU-BNC
> ---------  ------  ------
>
> patch: 7.65   7.29
>
> scarlet: 6.16   6.07
>
> emperor: 5.75   5.40
>
> bright: 4.43   4.44
>
>
>
> Strangely enough, Sketch Engine gives scores (for same corpus (BNC), node
> word (purple), span (3L, 3R), and collocates) that are about 40-50% higher,
> but still "within the ballpark":
>
>
>
> patch: 10.09
>
> scarlet: 9.44
>
> emperor: 8.24
>
> bright: 6.95
>
>
>
> ----------------
>
>
>
> Let's go step by step through the score for one particular collocate of
> "purple" -- "bright":
>
>
>
> N (corpus size) = 100,000,000
>
> f (purple) = 1262
>
> f (bright) = 5277
>
> f (purple, bright) = 9
>
> S (span size) = 6
>
>
>
> Using my calculation, one gets:
>
>
>
> ( log10 ( (100,000,000 * 9 ) / (1262 * 5277 * 6 ) ) / log (2)  = [ 4.49 ] ;
> close to BYU-BNC 4.44 and BNCweb 4.43
>
>
>
>
>
> With the MI formula from COLLOCATES given above:
>
> log2 ( ( N^(s-1) * f (x y) ) / ( f (x) * f (y) ) )
>
>
>
> on the other hand, one gets:
>
>
>
> log2 ( ( 100,000,000 ^ (6-1) * 9 ) / ( 1262 * 5277 ) ), or [ 113 ] , which
> is way off BYU-BNC and BNCweb and Sketch Engine. The problem here seems to
> be [ N ^ (span - 1) ] , which yields a huge numerator and the incorrect (??)
> MI score.
>
>
>
> Maybe I'm missing something obvious -- stats isn't my strong suit. But the
> fact that BYU-BNC and BNCweb agree so well (and the BNCweb people do know
> the formulas backwards and forwards), suggests that our formula is correct.
>
>
>
> One other question, I guess, is why Sketch Engine gives scores that are
> 40-50% off what is going on with BNCweb and BYU-BNC. I'm not saying that one
> is wrong and the other is right, but it's a bit disconcerting that the
> scores are not more similar. Maybe everyone could "cough up" their formulas,
> and we could see what's going on.
>
> MD
>
> ============================================
> Mark Davies
> Professor of (Corpus) Linguistics
> Brigham Young University
> (phone) 801-422-9168 / (fax) 801-422-0906
> Web: http://davies-linguistics.byu.edu
>
> ** Corpus design and use // Linguistic databases **
> ** Historical linguistics // Language variation **
> ** English, Spanish, and Portuguese **
> ============================================
>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>



-- 
Alex Clark

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list