[Corpora-List] Surprisingly large MI scores

Mon Sep 28 21:47:31 UTC 2009

Michael B. gave the MI formula from COLLOCATES as:

MI = log2 ( ( N^(s-1) * f (x y) ) / ( f (x) * f (y) ) )

Brett R. gives:

MI = log2 ( ( N * f (x y) ) / ( f (x) * f (y) ) )       ( where is the span ?)

I use (http://corpus.byu.edu):

MI = log10 ( ( N * f (x,y) ) / ( f (x) * f (y) * S ) ) / log(2)
(divide by log(2), since LOG in SQL Server is base 10)

where N = corpus size and S = span size.

This is apparently the same or quite similar to what is used for BNCweb. The following are the MI scores from BNCweb and BYU-BNC (http://corpus.byu.edu/bnc) for collocates of "purple" (span = 3L / 3R):

collocate BNCweb   BYU-BNC
---------  ------  ------

patch: 7.65   7.29

scarlet: 6.16   6.07

emperor: 5.75   5.40

bright: 4.43   4.44

Strangely enough, Sketch Engine gives scores (for same corpus (BNC), node word (purple), span (3L, 3R), and collocates) that are about 40-50% higher, but still "within the ballpark":

patch: 10.09

scarlet: 9.44

emperor: 8.24

bright: 6.95

----------------

Let's go step by step through the score for one particular collocate of "purple" -- "bright":

N (corpus size) = 100,000,000

f (purple) = 1262

f (bright) = 5277

f (purple, bright) = 9

S (span size) = 6

Using my calculation, one gets:

( log10 ( (100,000,000 * 9 ) / (1262 * 5277 * 6 ) ) / log (2)  = [ 4.49 ] ; close to BYU-BNC 4.44 and BNCweb 4.43

With the MI formula from COLLOCATES given above:

log2 ( ( N^(s-1) * f (x y) ) / ( f (x) * f (y) ) )

on the other hand, one gets:

log2 ( ( 100,000,000 ^ (6-1) * 9 ) / ( 1262 * 5277 ) ), or [ 113 ] , which is way off BYU-BNC and BNCweb and Sketch Engine. The problem here seems to be [ N ^ (span - 1) ] , which yields a huge numerator and the incorrect (??) MI score.

Maybe I'm missing something obvious -- stats isn't my strong suit. But the fact that BYU-BNC and BNCweb agree so well (and the BNCweb people do know the formulas backwards and forwards), suggests that our formula is correct.

One other question, I guess, is why Sketch Engine gives scores that are 40-50% off what is going on with BNCweb and BYU-BNC. I'm not saying that one is wrong and the other is right, but it's a bit disconcerting that the scores are not more similar. Maybe everyone could "cough up" their formulas, and we could see what's going on.

MD

============================================
Mark Davies
Professor of (Corpus) Linguistics
Brigham Young University
(phone) 801-422-9168 / (fax) 801-422-0906
Web: http://davies-linguistics.byu.edu

** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
============================================

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20090928/7dd16ab1/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora