[Corpora-List] Surprisingly large MI scores
Stefan Evert
stefan.evert at uos.de
Sun Sep 13 20:51:01 UTC 2009
Hi Linas,
thanks for bringing this topic up again. :-)
> I have a non-refereed blog post on this that includes graphs of the
> distribution of 2-grams in parsed text. The median is somewhere
> around MI=4 or 5 or so. Previous discussion on this mailing list
> concluded that much/most of the shape of these graphs can be
> obtained simply by picking out random pairs of words from a
> Zipf distribution. I have not carefully, formally verified this last
> hypothesis, but it seemed correct when last discussed.
In my own (informal and unsystematic) simulation experiments I
observed a much more skewed distribution, with few MI scores below 0,
but a considerable proportion of high scores (MI >= 10); I sent a plot
of this pattern to the list back then. The skewness and presence of
high scores is tied to the inclusion of low-frequency bigrams in the
data. If only bigrams with cooccurrence frequency > 5 are considered,
my highest MI scores were <= 4, and the distribution looked much more
symmetric.
> Also note the difference between the first and second graph:
> the second shows words collocated next to each other, and
> has a median MI of 1 or 2, whereas the first graph looks at
> 2-grams which possibly have intervening words in between
> the two words. (e.g. determiners a, the; or modifiers and
> qualifiers e.g. some, many, green blue, etc.) This case is
> more interesting for things like minimal spanning-tree
> dependency parse techniques (i.e. finding a parse tree by
> maximizing the MI of linkages between word pairs in a
> sentence).
I did a similar experiment on the BNC, using pairs of nouns that co-
occur in the same sentence. The resulting distribution is markedly
different both from Linas' data and my simulation experiments (based
on chance co-occurrence and a Zipfian distribution):
-------------- next part --------------
A non-text attachment was scrubbed...
Name: MI_noun_noun_sentences_BNC.png
Type: image/png
Size: 24471 bytes
Desc: not available
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20090913/e20e8e64/attachment.png>
-------------- next part --------------
As you can see, this distribution does not show the same spike at the
mode as the other distributions, and it's very strongly skewed towards
high MI scores. It is also remarkable, that there is much less of a
frequency bias and the distribution for f > 5 is qualitatively similar
to the distribution for hapaxes (f = 1).
This experiment also shows that -- depending on how candidate data
were obtained -- large MI scores up to MI = 20 are not at all
uncommon. In this case, the "lenient" definition of co-occurrence
(within the same sentence) enables many word pairs to achieve quite
high co-occurrence frequency. If MI scores are calculated for larger
n-grams, the expected frequency can easily become so low that very
high MI scores are obtained even for n-grams that occur just once or
twice.
Best to all of you,
Stefan
[ stefan.evert at uos.de | http://purl.org/stefan.evert ]
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list