[Corpora-List] Surprisingly large MI scores

Sun Sep 13 20:51:01 UTC 2009

Hi Linas,

thanks for bringing this topic up again. :-)

> I have a non-refereed blog post on this that includes graphs of the
> distribution of 2-grams in parsed text.  The median is somewhere
> around MI=4 or 5 or so.  Previous discussion on this mailing list
> concluded that much/most of the shape of these graphs can be
> obtained simply by picking out random pairs of words from a
> Zipf distribution.  I have not carefully, formally verified this last
> hypothesis, but it seemed correct when last discussed.

In my own (informal and unsystematic) simulation experiments I  
observed a much more skewed distribution, with few MI scores below 0,  
but a considerable proportion of high scores (MI >= 10); I sent a plot  
of this pattern to the list back then.  The skewness and presence of  
high scores is tied to the inclusion of low-frequency bigrams in the  
data.  If only bigrams with cooccurrence frequency > 5 are considered,  
my highest MI scores were <= 4, and the distribution looked much more  
symmetric.

> Also note the difference between the first and second graph:
> the second shows words collocated next to each other,  and
> has a median MI of 1 or 2, whereas the first graph looks at
> 2-grams which possibly have intervening words in between
> the two words. (e.g. determiners a, the; or modifiers and
> qualifiers e.g. some, many, green blue, etc.)  This case is
> more interesting for things like minimal spanning-tree
> dependency parse techniques (i.e. finding a parse tree by
> maximizing the MI of linkages  between word pairs in a
> sentence).

I did a similar experiment on the BNC, using pairs of nouns that co- 
occur in the same sentence.  The resulting distribution is markedly  
different both from Linas' data and my simulation experiments (based  
on chance co-occurrence and a Zipfian distribution):
-------------- next part --------------
A non-text attachment was scrubbed...
Name: MI_noun_noun_sentences_BNC.png
Type: image/png
Size: 24471 bytes
Desc: not available
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20090913/e20e8e64/attachment.png>
-------------- next part --------------

As you can see, this distribution does not show the same spike at the  
mode as the other distributions, and it's very strongly skewed towards  
high MI scores.  It is also remarkable, that there is much less of a  
frequency bias and the distribution for f > 5 is qualitatively similar  
to the distribution for hapaxes (f = 1).

This experiment also shows that -- depending on how candidate data  
were obtained -- large MI scores up to MI = 20 are not at all  
uncommon.   In this case, the "lenient" definition of co-occurrence  
(within the same sentence) enables many word pairs to achieve quite  
high co-occurrence frequency.  If MI scores are calculated for larger  
n-grams, the expected frequency can easily become so low that very  
high MI scores are obtained even for n-grams that occur just once or  
twice.

Best to all of you,
Stefan

[ stefan.evert at uos.de | http://purl.org/stefan.evert ]

-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora