<html dir="ltr"><head>

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

<style title="owaParaStyle"><!--P {

        MARGIN-TOP: 0px; MARGIN-BOTTOM: 0px

}

--></style>

</head>

<body ocsi="x">

<p>Michael B. gave the MI formula from COLLOCATES as:<br>

<br>

MI = log2 ( ( N<font style="BACKGROUND-COLOR: #ffff00">^(s-1)</font> * f (x y) ) / ( f (x) * f (y) ) )<br>

<br>

Brett R. gives:<br>

<br>

MI = log2 ( ( N * f (x y) ) / ( f (x) * f (y) ) )       <font style="BACKGROUND-COLOR: #ffff00">

( where is the span ?)</font><br>

</p>

<p><font face="tahoma"></font><br>

I use (<a href="http://corpus.byu.edu">http://corpus.byu.edu</a>):</p>

<font face="tahoma"></font>

<p><br>

MI = log10 ( ( N * f (x,y) ) / ( f (x) * f (y) <font style="BACKGROUND-COLOR: #ffff00">

* S</font> ) ) / <font style="BACKGROUND-COLOR: #ffffff">log(2)<br>

(divide by log(2), since LOG in SQL Server is base 10)</font></p>

<p><font face="tahoma"></font><font face="tahoma"></font> </p>

<p><font face="tahoma">where N = corpus size and S = span size.</font></p>

<p><font face="tahoma"></font> </p>

<p><font face="tahoma">This is apparently the same or quite similar to what is used for BNCweb. The following are the MI scores from BNCweb and BYU-BNC (<a href="http://corpus.byu.edu/bnc">http://corpus.byu.edu/bnc</a>) for collocates of "purple" (span = 3L

 / 3R):</font></p>

<p><font face="tahoma"></font> </p>

<p><font face="tahoma">collocate BNCweb   BYU-BNC<br>

</font><font face="tahoma">---------  ------  ------</font></p>

<font face="tahoma">

<p>patch: 7.65   7.29</font></p>

<p>scarlet: 6.16   6.07</p>

<p><font face="tahoma">emperor: 5.75   5.40</font></p>

<p><font face="tahoma">bright: 4.43   4.44</font></p>

<p><font face="tahoma"></font> </p>

<p><font face="tahoma">Strangely enough, Sketch Engine gives scores (for same corpus (BNC), node word (purple), span (3L, 3R), and collocates) that are about 40-50% higher, but still "within the ballpark":</font></p>

<p> </p>

<p>patch: 10.09</p>

<p>scarlet: 9.44</p>

<p><font face="tahoma">emperor: 8.24</font></p>

<p><font face="tahoma">bright: 6.95</font><br>

</p>

<p> </p>

<p><font face="tahoma">----------------</font></p>

<p><font face="tahoma"></font> </p>

<p>Let's go step by step through the score for one particular collocate of "purple" -- "bright":</p>

<p><font face="tahoma"></font><font face="tahoma"></font> </p>

<p><font face="tahoma">N (corpus size) = 100,000,000</font></p>

<p><font face="tahoma">f (purple) = 1262</font></p>

<p><font face="tahoma">f (bright) = 5277</font></p>

<p><font face="tahoma">f (purple, bright) = 9</font></p>

<p><font face="tahoma">S (span size) = 6</font></p>

<p><font face="tahoma"></font> </p>

<p><font face="tahoma">Using my calculation, one gets:</font></p>

<p><font face="tahoma"></font> </p>

<p><font face="tahoma">( log10 ( (100,000,000 * 9 ) / (1262 * 5277 * 6 ) ) / log (2)  = [ <font style="BACKGROUND-COLOR: #ff00ff">4.49</font> ] ; close to BYU-BNC 4.44 and BNCweb 4.43</font></p>

<p><font face="tahoma"></font> </p>

<p><font face="tahoma"></font> </p>

<p><font face="tahoma">With the MI formula from COLLOCATES given above:<br>

</font></p>

<p><font face="tahoma"><font style="BACKGROUND-COLOR: #ccffff"><font face="tahoma"><font style="BACKGROUND-COLOR: #ccffff">log2</font></font> ( ( N<font style="BACKGROUND-COLOR: #ffff00">^(s-1)</font> * f (x y) ) / ( f (x) * f (y) ) )</font></font></p>

<p><font face="tahoma"></font> </p>

<p><font face="tahoma">on the other hand, one gets:</font></p>

<p><font face="tahoma"><font face="tahoma"></font></font> </p>

<p><font face="tahoma"><font face="tahoma"><font face="tahoma"><font face="tahoma">log2</font></font> ( ( 100,000,000 ^ (6-1) * 9 ) / ( 1262 * 5277 ) ), or [ <font style="BACKGROUND-COLOR: #ff00ff">113</font><font style="BACKGROUND-COLOR: #ffffff"> ]

</font><font style="BACKGROUND-COLOR: #ffffff">, which is way off BYU-BNC and BNCweb and Sketch Engine. The problem here seems to be [ <font style="BACKGROUND-COLOR: #ffff00">N ^ (span - 1)</font><font style="BACKGROUND-COLOR: #ffffff"> ]

</font>, which yields a huge numerator and the incorrect (??) MI score.</font></font></font></p>

<p><font face="tahoma"><font face="tahoma"></font> </p>

<p>Maybe I'm missing something obvious -- stats isn't my strong suit. But the fact that BYU-BNC and BNCweb agree so well (and the BNCweb people do know the formulas backwards and forwards), suggests that our formula is correct.

</p>

<p> </p>

<p>One other question, I guess, is why Sketch Engine gives scores that are 40-50% off what is going on with BNCweb and BYU-BNC. I'm not saying that one is wrong and the other is right, but it's a bit disconcerting that the scores are not more similar. Maybe

 everyone could "cough up" their formulas, and we could see what's going on.<br>

</p>

</font>

<p>MD</p>

<p><font face="tahoma"></font><br>

============================================<br>

Mark Davies<br>

Professor of (Corpus) Linguistics<br>

Brigham Young University<br>

(phone) 801-422-9168 / (fax) 801-422-0906<br>

Web: http://davies-linguistics.byu.edu<br>

 <br>

** Corpus design and use // Linguistic databases **<br>

** Historical linguistics // Language variation **<br>

** English, Spanish, and Portuguese **<br>

============================================<br>

<br>

</p>

</body>

</html>