[Corpora-List] Surprisingly large MI scores
Stefan Evert
stefan.evert at uos.de
Tue Sep 29 08:44:12 UTC 2009
> Michael B. gave the MI formula from COLLOCATES as:
>
> MI = log2 ( ( N^(s-1) * f (x y) ) / ( f (x) * f (y) ) )
Not really. What Michael gave was (shown here in a more explicit
notation)
MI = log2 ( ( N^(s-1) * f (x1 x2 ... x_s) ) / ( f (x_1) * f (x_2)
* ... * f(x_s) ) )
which is the (AFAIC correct) MI equation for n-grams (where n = s),
i.e. for a combination of s consecutive words.
> I use (http://corpus.byu.edu):
>
> MI = log10 ( ( N * f (x,y) ) / ( f (x) * f (y) * S ) ) / log(2)
> (divide by log(2), since LOG in SQL Server is base 10)
>
> where N = corpus size and S = span size.
This is a reasonable approximation of MI scores for surface
collocations with a span size of S, i.e. combinations of _two_ lexemes
which co-occur within a distance of at most S words.
> Brett R. gives:
>
> MI = log2 ( ( N * f (x y) ) / ( f (x) * f (y) ) ) ( where is
> the span ?)
This is the MI score for adjacent bigrams, which is compatible with
both formulas above: in Michael's version, you have to set s=2 (for a
bigram), in Mark's version, you have a window size of S=1 (for a 0L/1R
span).
> This is apparently the same or quite similar to what is used for
> BNCweb.
Yes, we use the correct mathematical model for surface collocations,
which puts much more strain on the SQL database, but is usually close
to your approximation.
> One other question, I guess, is why Sketch Engine gives scores that
> are 40-50% off what is going on with BNCweb and BYU-BNC. I'm not
> saying that one is wrong and the other is right, but it's a bit
> disconcerting that the scores are not more similar. Maybe everyone
> could "cough up" their formulas, and we could see what's going on.
Are these really MI scores? At least by default, the Sketch Engine
calculates something different, which Adam calls a "salience score".
Cheers,
Stefan
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list