<html dir="ltr"><head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<style title="owaParaStyle"><!--P {
MARGIN-TOP: 0px; MARGIN-BOTTOM: 0px
}
--></style>
</head>
<body ocsi="x">
<p>Michael B. gave the MI formula from COLLOCATES as:<br>
<br>
MI = log2 ( ( N<font style="BACKGROUND-COLOR: #ffff00">^(s-1)</font> * f (x y) ) / ( f (x) * f (y) ) )<br>
<br>
Brett R. gives:<br>
<br>
MI = log2 ( ( N * f (x y) ) / ( f (x) * f (y) ) ) <font style="BACKGROUND-COLOR: #ffff00">
( where is the span ?)</font><br>
</p>
<p><font face="tahoma"></font><br>
I use (<a href="http://corpus.byu.edu">http://corpus.byu.edu</a>):</p>
<font face="tahoma"></font>
<p><br>
MI = log10 ( ( N * f (x,y) ) / ( f (x) * f (y) <font style="BACKGROUND-COLOR: #ffff00">
* S</font> ) ) / <font style="BACKGROUND-COLOR: #ffffff">log(2)<br>
(divide by log(2), since LOG in SQL Server is base 10)</font></p>
<p><font face="tahoma"></font><font face="tahoma"></font> </p>
<p><font face="tahoma">where N = corpus size and S = span size.</font></p>
<p><font face="tahoma"></font> </p>
<p><font face="tahoma">This is apparently the same or quite similar to what is used for BNCweb. The following are the MI scores from BNCweb and BYU-BNC (<a href="http://corpus.byu.edu/bnc">http://corpus.byu.edu/bnc</a>) for collocates of "purple" (span = 3L
/ 3R):</font></p>
<p><font face="tahoma"></font> </p>
<p><font face="tahoma">collocate BNCweb BYU-BNC<br>
</font><font face="tahoma">--------- ------ ------</font></p>
<font face="tahoma">
<p>patch: 7.65 7.29</font></p>
<p>scarlet: 6.16 6.07</p>
<p><font face="tahoma">emperor: 5.75 5.40</font></p>
<p><font face="tahoma">bright: 4.43 4.44</font></p>
<p><font face="tahoma"></font> </p>
<p><font face="tahoma">Strangely enough, Sketch Engine gives scores (for same corpus (BNC), node word (purple), span (3L, 3R), and collocates) that are about 40-50% higher, but still "within the ballpark":</font></p>
<p> </p>
<p>patch: 10.09</p>
<p>scarlet: 9.44</p>
<p><font face="tahoma">emperor: 8.24</font></p>
<p><font face="tahoma">bright: 6.95</font><br>
</p>
<p> </p>
<p><font face="tahoma">----------------</font></p>
<p><font face="tahoma"></font> </p>
<p>Let's go step by step through the score for one particular collocate of "purple" -- "bright":</p>
<p><font face="tahoma"></font><font face="tahoma"></font> </p>
<p><font face="tahoma">N (corpus size) = 100,000,000</font></p>
<p><font face="tahoma">f (purple) = 1262</font></p>
<p><font face="tahoma">f (bright) = 5277</font></p>
<p><font face="tahoma">f (purple, bright) = 9</font></p>
<p><font face="tahoma">S (span size) = 6</font></p>
<p><font face="tahoma"></font> </p>
<p><font face="tahoma">Using my calculation, one gets:</font></p>
<p><font face="tahoma"></font> </p>
<p><font face="tahoma">( log10 ( (100,000,000 * 9 ) / (1262 * 5277 * 6 ) ) / log (2) = [ <font style="BACKGROUND-COLOR: #ff00ff">4.49</font> ] ; close to BYU-BNC 4.44 and BNCweb 4.43</font></p>
<p><font face="tahoma"></font> </p>
<p><font face="tahoma"></font> </p>
<p><font face="tahoma">With the MI formula from COLLOCATES given above:<br>
</font></p>
<p><font face="tahoma"><font style="BACKGROUND-COLOR: #ccffff"><font face="tahoma"><font style="BACKGROUND-COLOR: #ccffff">log2</font></font> ( ( N<font style="BACKGROUND-COLOR: #ffff00">^(s-1)</font> * f (x y) ) / ( f (x) * f (y) ) )</font></font></p>
<p><font face="tahoma"></font> </p>
<p><font face="tahoma">on the other hand, one gets:</font></p>
<p><font face="tahoma"><font face="tahoma"></font></font> </p>
<p><font face="tahoma"><font face="tahoma"><font face="tahoma"><font face="tahoma">log2</font></font> ( ( 100,000,000 ^ (6-1) * 9 ) / ( 1262 * 5277 ) ), or [ <font style="BACKGROUND-COLOR: #ff00ff">113</font><font style="BACKGROUND-COLOR: #ffffff"> ]
</font><font style="BACKGROUND-COLOR: #ffffff">, which is way off BYU-BNC and BNCweb and Sketch Engine. The problem here seems to be [ <font style="BACKGROUND-COLOR: #ffff00">N ^ (span - 1)</font><font style="BACKGROUND-COLOR: #ffffff"> ]
</font>, which yields a huge numerator and the incorrect (??) MI score.</font></font></font></p>
<p><font face="tahoma"><font face="tahoma"></font> </p>
<p>Maybe I'm missing something obvious -- stats isn't my strong suit. But the fact that BYU-BNC and BNCweb agree so well (and the BNCweb people do know the formulas backwards and forwards), suggests that our formula is correct.
</p>
<p> </p>
<p>One other question, I guess, is why Sketch Engine gives scores that are 40-50% off what is going on with BNCweb and BYU-BNC. I'm not saying that one is wrong and the other is right, but it's a bit disconcerting that the scores are not more similar. Maybe
everyone could "cough up" their formulas, and we could see what's going on.<br>
</p>
</font>
<p>MD</p>
<p><font face="tahoma"></font><br>
============================================<br>
Mark Davies<br>
Professor of (Corpus) Linguistics<br>
Brigham Young University<br>
(phone) 801-422-9168 / (fax) 801-422-0906<br>
Web: http://davies-linguistics.byu.edu<br>
<br>
** Corpus design and use // Linguistic databases **<br>
** Historical linguistics // Language variation **<br>
** English, Spanish, and Portuguese **<br>
============================================<br>
<br>
</p>
</body>
</html>