[Corpora-List] MI for more than 2 items

Wed May 15 22:04:32 UTC 2013

Hello,

we ran into the similar issue previously with one of my colleague. Most of
the papers only discuss association measures for two units. However, some
ways to take into account longer lexical units have been suggested in the
literature.

You can check :
- V. Seretan, L. Nerima, and E. Wehrli. 2003. Extraction of Multi-Word
Collocations Using Syntactic Bigram Composition.
- J. da Silva, G. Dias, S. Guilloré, and J. Pereira Lopes. 1999. Using
localmaxs algorithm for the extraction of contiguous and non-contiguous
multiword lexical units.
- and our paper for a short discussion (section 2.2.1) and some links :
Watrin, P. et François T. (2011) An N-gram frequency database reference to
handle MWE extraction in NLP applications

I hope it helps,

Best,

Thomas François

> The mutual information score that lexicographers use is a close relative
> of
> the mathematical notion of mutual information between two random
> variables.
> Peter Turney and others have been careful to reflect this distinction by
> using the term 'pointwise mutual information' (PMI)  for the
> lexicographer's version and MI for the other.  Technically, MI is the sum
> over all cells of a two dimensional matrix of the PMI. This means that you
> can begin to think of PMI as something like "the contribution of a
> particular pair of words to MI". And lexicographers have had fair success
> interpreting it this way. The mathematicians tend to look askance at PMI,
> because of concerns like "the PMI for a pair of words can in principle be
> negative even when the MI summed over all words is positive. What (the
> hell) does that mean?"
>
> MI is a central notion of information theory, and backed by many useful
> mathematical results. For the task of measuring word association, the
> mathematical advantages
> of MI do not really translate into a preference for using PMI rather than
> some other measure of association. If it works for you, that's OK. You
> don't get much extra from the connection to the mathematics.
>
> Once you move to three or more terms, things get even more complex. The
> generalizations of MI to three or more terms are confusing in themselves,
> just because interactions between three or more variables are much more
> complicated than interactions between just two. The generalizations of PMI
> would be at least as messy, possibly worse, so it is no surprise that
> mathematical support for such generalizations is missing.
>
>
>
>
>
> On Tue, May 14, 2013 at 10:14 AM, Mike Scott <mike at lexically.net> wrote:
>
>>  I have had a query about MI (or any other similar statistic) involving
>> more than two elements:
>>
>> "I don't know how to calculate the Mutual Information (MI) for these
>> 4-word lexical bundles, it seems I can only find the MI score for 2-word
>> collocations."
>>
>> Can anyone advise please?
>>
>> Cheers -- Mike
>>
>> --
>> Mike Scott
>>
>> ***
>> If you publish research which uses WordSmith, do let me know so I can
>> include it
>> athttp://www.lexically.net/wordsmith/corpus_linguistics_links/papers_using_wordsmith.htm
>> ***
>> University of Aston and Lexical Analysis Software
>> Ltd.mike.scott at aston.ac.ukwww.lexically.net
>>
>>
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>>
>>
>
>
> --
> Chris Brew
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora