[Corpora-List] Surprisingly large MI scores
Linas Vepstas
linasvepstas at gmail.com
Tue Sep 8 18:32:04 UTC 2009
Hi Brett,
2009/9/5 Brett Reynolds <brett at forsyths.ca>:
> "Formulaic language in native speakers: Triangulating psycholinguistics,
> corpus linguistics, and education" by Nick C. Ellis and Rita Simpson-Vlach
> was recently published in _Corpus Linguistics and Linguistic Theory_
> <http://www.reference-global.com/doi/abs/10.1515/CLLT.2009.003>.
>
> Therein is a table of n-grams in three columns: low, medium, and high MI
> scores. I'm going from memory, but the authors consider roughly MI=3 as low,
> MI=6 as medium, and MI=12 as high.
>
> I have only a very rudimentary understanding of MI scores, but my
> understanding is that an MI of 3 indicates a strong collocation,
I have a non-refereed blog post on this that includes graphs of the
distribution of 2-grams in parsed text. The median is somewhere
around MI=4 or 5 or so. Previous discussion on this mailing list
concluded that much/most of the shape of these graphs can be
obtained simply by picking out random pairs of words from a
Zipf distribution. I have not carefully, formally verified this last
hypothesis, but it seemed correct when last discussed.
See
http://opencog.wordpress.com/2009/03/11/distribution-of-mutual-information/
Also note the difference between the first and second graph:
the second shows words collocated next to each other, and
has a median MI of 1 or 2, whereas the first graph looks at
2-grams which possibly have intervening words in between
the two words. (e.g. determiners a, the; or modifiers and
qualifiers e.g. some, many, green blue, etc.) This case is
more interesting for things like minimal spanning-tree
dependency parse techniques (i.e. finding a parse tree by
maximizing the MI of linkages between word pairs in a
sentence).
At the rate I'm going, I will probably will not get around to writing
a formal article on this for any refereed journal. I'm a stickler for
details, and doing things "the right way" can take a lot of time.
:-)
--linas
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list