[Corpora-List] MI for more than 2 items

Chris Brew christopher.brew at gmail.com
Wed May 15 22:34:43 UTC 2013


Yes. The hypothesis of complete independence is a thoroughly well-defined
thing to test against, and the only real
issue is whether one can find anything interesting by examining the values
that result.


On Wed, May 15, 2013 at 3:20 PM, Ted Pedersen <tpederse at d.umn.edu> wrote:

> The  Ngram Statistics Package includes supports for 3-grams from
> mutual information, pointwise mutual information, the log-likelihood
> ratio, and the poisson-stirling measure. It also includes 4-gram
> support for the log likelihood ratio. It does this in a relatively
> straightforward way, where the model you estimate expected values for
> (and compare to the observed data) is based on a hypothesis of
> complete independence between all the words in the ngram. But, while
> simple I would say that it has some mathematical validity.
>
> Whether or not it provides useful results is of course the real test.
> In truth as you extend these measures beyond bigrams you usually end
> up with scores (in the case of loglikelihood at least) that almost
> always translate into statistical significance, since a longer ngram
> really has no chance of simply occurring by chance (and indeed the
> same is true of shorter ngrams, but at least statistically they can
> appear to be chance events). So, I would urge caution in trying to
> assign statistical significance to these values, but I do think they
> can be useful when used for relative rankings.
>
> More about NSP can be found here : http://ngram.sourceforge.net
>
> Cordially,
> Ted
>
> On Wed, May 15, 2013 at 2:57 PM, Lushan Han <lushan1 at umbc.edu> wrote:
> > Then, is there any solid, mathematically-based association measure
> between
> > three or more variables?
> >
> > Thanks,
> >
> > Lushan Han
> >
> >
> > On Wed, May 15, 2013 at 2:53 PM, Chris Brew <christopher.brew at gmail.com>
> > wrote:
> >>
> >> The mutual information score that lexicographers use is a close relative
> >> of the mathematical notion of mutual information between two random
> >> variables. Peter Turney and others have been careful to reflect this
> >> distinction by using the term 'pointwise mutual information' (PMI)  for
> the
> >> lexicographer's version and MI for the other.  Technically, MI is the
> sum
> >> over all cells of a two dimensional matrix of the PMI. This means that
> you
> >> can begin to think of PMI as something like "the contribution of a
> >> particular pair of words to MI". And lexicographers have had fair
> success
> >> interpreting it this way. The mathematicians tend to look askance at
> PMI,
> >> because of concerns like "the PMI for a pair of words can in principle
> be
> >> negative even when the MI summed over all words is positive. What (the
> hell)
> >> does that mean?"
> >>
> >> MI is a central notion of information theory, and backed by many useful
> >> mathematical results. For the task of measuring word association, the
> >> mathematical advantages
> >> of MI do not really translate into a preference for using PMI rather
> than
> >> some other measure of association. If it works for you, that's OK. You
> don't
> >> get much extra from the connection to the mathematics.
> >>
> >> Once you move to three or more terms, things get even more complex. The
> >> generalizations of MI to three or more terms are confusing in
> themselves,
> >> just because interactions between three or more variables are much more
> >> complicated than interactions between just two. The generalizations of
> PMI
> >> would be at least as messy, possibly worse, so it is no surprise that
> >> mathematical support for such generalizations is missing.
> >>
> >>
> >>
> >>
> >>
> >> On Tue, May 14, 2013 at 10:14 AM, Mike Scott <mike at lexically.net>
> wrote:
> >>>
> >>> I have had a query about MI (or any other similar statistic) involving
> >>> more than two elements:
> >>>
> >>> "I don't know how to calculate the Mutual Information (MI) for these
> >>> 4-word lexical bundles, it seems I can only find the MI score for
> 2-word
> >>> collocations."
> >>>
> >>> Can anyone advise please?
> >>>
> >>> Cheers -- Mike
> >>>
> >>> --
> >>> Mike Scott
> >>>
> >>> ***
> >>> If you publish research which uses WordSmith, do let me know so I can
> >>> include it at
> >>>
> >>>
> http://www.lexically.net/wordsmith/corpus_linguistics_links/papers_using_wordsmith.htm
> >>> ***
> >>> University of Aston and Lexical Analysis Software Ltd.
> >>> mike.scott at aston.ac.uk
> >>> www.lexically.net
> >>>
> >>>
> >>> _______________________________________________
> >>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> >>> Corpora mailing list
> >>> Corpora at uib.no
> >>> http://mailman.uib.no/listinfo/corpora
> >>>
> >>
> >>
> >>
> >> --
> >> Chris Brew
> >>
> >> _______________________________________________
> >> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> >> Corpora mailing list
> >> Corpora at uib.no
> >> http://mailman.uib.no/listinfo/corpora
> >>
> >
> >
> > _______________________________________________
> > UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> > Corpora mailing list
> > Corpora at uib.no
> > http://mailman.uib.no/listinfo/corpora
> >
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>



-- 
Chris Brew
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20130515/e89ac738/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list