[Corpora-List] Surprisingly large MI scores

Stefan Evert stefan.evert at uos.de
Tue Sep 29 08:44:12 UTC 2009


> Michael B. gave the MI formula from COLLOCATES as:
>
> MI = log2 ( ( N^(s-1) * f (x y) ) / ( f (x) * f (y) ) )

Not really.  What Michael gave was (shown here in a more explicit  
notation)

	MI = log2 ( ( N^(s-1) * f (x1 x2 ... x_s) ) / ( f (x_1) * f (x_2)  
* ... * f(x_s) ) )

which is the (AFAIC correct) MI equation for n-grams (where n = s),  
i.e. for a combination of s consecutive words.

> I use (http://corpus.byu.edu):
>
> MI = log10 ( ( N * f (x,y) ) / ( f (x) * f (y) * S ) ) / log(2)
> (divide by log(2), since LOG in SQL Server is base 10)
>
> where N = corpus size and S = span size.

This is a reasonable approximation of MI scores for surface  
collocations with a span size of S, i.e. combinations of _two_ lexemes  
which co-occur within a distance of at most S words.

> Brett R. gives:
>
> MI = log2 ( ( N * f (x y) ) / ( f (x) * f (y) ) )       ( where is  
> the span ?)

This is the MI score for adjacent bigrams, which is compatible with  
both formulas above: in Michael's version, you have to set s=2 (for a  
bigram), in Mark's version, you have a window size of S=1 (for a 0L/1R  
span).

> This is apparently the same or quite similar to what is used for  
> BNCweb.

Yes, we use the correct mathematical model for surface collocations,  
which puts much more strain on the SQL database, but is usually close  
to your approximation.

> One other question, I guess, is why Sketch Engine gives scores that  
> are 40-50% off what is going on with BNCweb and BYU-BNC. I'm not  
> saying that one is wrong and the other is right, but it's a bit  
> disconcerting that the scores are not more similar. Maybe everyone  
> could "cough up" their formulas, and we could see what's going on.

Are these really MI scores?  At least by default, the Sketch Engine  
calculates something different, which Adam calls a "salience score".

Cheers,
Stefan


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list