[Corpora-List] Surprisingly large MI scores

Linas Vepstas linasvepstas at gmail.com
Tue Sep 8 18:32:04 UTC 2009


Hi Brett,

2009/9/5 Brett Reynolds <brett at forsyths.ca>:
> "Formulaic language in native speakers: Triangulating psycholinguistics,
> corpus linguistics, and education" by Nick C. Ellis and Rita Simpson-Vlach
> was recently published in _Corpus Linguistics and Linguistic Theory_
> <http://www.reference-global.com/doi/abs/10.1515/CLLT.2009.003>.
>
> Therein is a table of n-grams in three columns: low, medium, and high MI
> scores. I'm going from memory, but the authors consider roughly MI=3 as low,
> MI=6 as medium, and MI=12 as high.
>
> I have only a very rudimentary understanding of MI scores, but my
> understanding is that an MI of 3 indicates a strong collocation,

I have a non-refereed blog post on this that includes graphs of the
distribution of 2-grams in parsed text.  The median is somewhere
around MI=4 or 5 or so.  Previous discussion on this mailing list
concluded that much/most of the shape of these graphs can be
obtained simply by picking out random pairs of words from a
Zipf distribution.  I have not carefully, formally verified this last
hypothesis, but it seemed correct when last discussed.

See

http://opencog.wordpress.com/2009/03/11/distribution-of-mutual-information/

Also note the difference between the first and second graph:
the second shows words collocated next to each other,  and
has a median MI of 1 or 2, whereas the first graph looks at
2-grams which possibly have intervening words in between
the two words. (e.g. determiners a, the; or modifiers and
qualifiers e.g. some, many, green blue, etc.)  This case is
more interesting for things like minimal spanning-tree
dependency parse techniques (i.e. finding a parse tree by
maximizing the MI of linkages  between word pairs in a
sentence).

At the rate I'm going, I will probably will not get around to writing
a formal article on this for any refereed journal. I'm a stickler for
details, and doing things "the right way" can take a lot of time.
:-)

--linas

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list