[Corpora-List] MI for more than 2 items

Ted Pedersen tpederse at d.umn.edu
Wed May 15 22:20:33 UTC 2013


The  Ngram Statistics Package includes supports for 3-grams from
mutual information, pointwise mutual information, the log-likelihood
ratio, and the poisson-stirling measure. It also includes 4-gram
support for the log likelihood ratio. It does this in a relatively
straightforward way, where the model you estimate expected values for
(and compare to the observed data) is based on a hypothesis of
complete independence between all the words in the ngram. But, while
simple I would say that it has some mathematical validity.

Whether or not it provides useful results is of course the real test.
In truth as you extend these measures beyond bigrams you usually end
up with scores (in the case of loglikelihood at least) that almost
always translate into statistical significance, since a longer ngram
really has no chance of simply occurring by chance (and indeed the
same is true of shorter ngrams, but at least statistically they can
appear to be chance events). So, I would urge caution in trying to
assign statistical significance to these values, but I do think they
can be useful when used for relative rankings.

More about NSP can be found here : http://ngram.sourceforge.net

Cordially,
Ted

On Wed, May 15, 2013 at 2:57 PM, Lushan Han <lushan1 at umbc.edu> wrote:
> Then, is there any solid, mathematically-based association measure between
> three or more variables?
>
> Thanks,
>
> Lushan Han
>
>
> On Wed, May 15, 2013 at 2:53 PM, Chris Brew <christopher.brew at gmail.com>
> wrote:
>>
>> The mutual information score that lexicographers use is a close relative
>> of the mathematical notion of mutual information between two random
>> variables. Peter Turney and others have been careful to reflect this
>> distinction by using the term 'pointwise mutual information' (PMI)  for the
>> lexicographer's version and MI for the other.  Technically, MI is the sum
>> over all cells of a two dimensional matrix of the PMI. This means that you
>> can begin to think of PMI as something like "the contribution of a
>> particular pair of words to MI". And lexicographers have had fair success
>> interpreting it this way. The mathematicians tend to look askance at PMI,
>> because of concerns like "the PMI for a pair of words can in principle be
>> negative even when the MI summed over all words is positive. What (the hell)
>> does that mean?"
>>
>> MI is a central notion of information theory, and backed by many useful
>> mathematical results. For the task of measuring word association, the
>> mathematical advantages
>> of MI do not really translate into a preference for using PMI rather than
>> some other measure of association. If it works for you, that's OK. You don't
>> get much extra from the connection to the mathematics.
>>
>> Once you move to three or more terms, things get even more complex. The
>> generalizations of MI to three or more terms are confusing in themselves,
>> just because interactions between three or more variables are much more
>> complicated than interactions between just two. The generalizations of PMI
>> would be at least as messy, possibly worse, so it is no surprise that
>> mathematical support for such generalizations is missing.
>>
>>
>>
>>
>>
>> On Tue, May 14, 2013 at 10:14 AM, Mike Scott <mike at lexically.net> wrote:
>>>
>>> I have had a query about MI (or any other similar statistic) involving
>>> more than two elements:
>>>
>>> "I don't know how to calculate the Mutual Information (MI) for these
>>> 4-word lexical bundles, it seems I can only find the MI score for 2-word
>>> collocations."
>>>
>>> Can anyone advise please?
>>>
>>> Cheers -- Mike
>>>
>>> --
>>> Mike Scott
>>>
>>> ***
>>> If you publish research which uses WordSmith, do let me know so I can
>>> include it at
>>>
>>> http://www.lexically.net/wordsmith/corpus_linguistics_links/papers_using_wordsmith.htm
>>> ***
>>> University of Aston and Lexical Analysis Software Ltd.
>>> mike.scott at aston.ac.uk
>>> www.lexically.net
>>>
>>>
>>> _______________________________________________
>>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>>> Corpora mailing list
>>> Corpora at uib.no
>>> http://mailman.uib.no/listinfo/corpora
>>>
>>
>>
>>
>> --
>> Chris Brew
>>
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>>
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list