[Corpora-List] Surprisingly large MI scores
Mark Davies
Mark_Davies at byu.edu
Mon Sep 28 21:47:31 UTC 2009
Michael B. gave the MI formula from COLLOCATES as:
MI = log2 ( ( N^(s-1) * f (x y) ) / ( f (x) * f (y) ) )
Brett R. gives:
MI = log2 ( ( N * f (x y) ) / ( f (x) * f (y) ) ) ( where is the span ?)
I use (http://corpus.byu.edu):
MI = log10 ( ( N * f (x,y) ) / ( f (x) * f (y) * S ) ) / log(2)
(divide by log(2), since LOG in SQL Server is base 10)
where N = corpus size and S = span size.
This is apparently the same or quite similar to what is used for BNCweb. The following are the MI scores from BNCweb and BYU-BNC (http://corpus.byu.edu/bnc) for collocates of "purple" (span = 3L / 3R):
collocate BNCweb BYU-BNC
--------- ------ ------
patch: 7.65 7.29
scarlet: 6.16 6.07
emperor: 5.75 5.40
bright: 4.43 4.44
Strangely enough, Sketch Engine gives scores (for same corpus (BNC), node word (purple), span (3L, 3R), and collocates) that are about 40-50% higher, but still "within the ballpark":
patch: 10.09
scarlet: 9.44
emperor: 8.24
bright: 6.95
----------------
Let's go step by step through the score for one particular collocate of "purple" -- "bright":
N (corpus size) = 100,000,000
f (purple) = 1262
f (bright) = 5277
f (purple, bright) = 9
S (span size) = 6
Using my calculation, one gets:
( log10 ( (100,000,000 * 9 ) / (1262 * 5277 * 6 ) ) / log (2) = [ 4.49 ] ; close to BYU-BNC 4.44 and BNCweb 4.43
With the MI formula from COLLOCATES given above:
log2 ( ( N^(s-1) * f (x y) ) / ( f (x) * f (y) ) )
on the other hand, one gets:
log2 ( ( 100,000,000 ^ (6-1) * 9 ) / ( 1262 * 5277 ) ), or [ 113 ] , which is way off BYU-BNC and BNCweb and Sketch Engine. The problem here seems to be [ N ^ (span - 1) ] , which yields a huge numerator and the incorrect (??) MI score.
Maybe I'm missing something obvious -- stats isn't my strong suit. But the fact that BYU-BNC and BNCweb agree so well (and the BNCweb people do know the formulas backwards and forwards), suggests that our formula is correct.
One other question, I guess, is why Sketch Engine gives scores that are 40-50% off what is going on with BNCweb and BYU-BNC. I'm not saying that one is wrong and the other is right, but it's a bit disconcerting that the scores are not more similar. Maybe everyone could "cough up" their formulas, and we could see what's going on.
MD
============================================
Mark Davies
Professor of (Corpus) Linguistics
Brigham Young University
(phone) 801-422-9168 / (fax) 801-422-0906
Web: http://davies-linguistics.byu.edu
** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
============================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20090928/7dd16ab1/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list