<div dir="ltr">Yes. The hypothesis of complete independence is a thoroughly well-defined thing to test against, and the only real <div><div>issue is whether one can find anything interesting by examining the values that result. </div>

<div><div class="gmail_extra"><br><br><div class="gmail_quote">On Wed, May 15, 2013 at 3:20 PM, Ted Pedersen <span dir="ltr"><<a href="mailto:tpederse@d.umn.edu" target="_blank">tpederse@d.umn.edu</a>></span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">The  Ngram Statistics Package includes supports for 3-grams from<br>

mutual information, pointwise mutual information, the log-likelihood<br>

ratio, and the poisson-stirling measure. It also includes 4-gram<br>

support for the log likelihood ratio. It does this in a relatively<br>

straightforward way, where the model you estimate expected values for<br>

(and compare to the observed data) is based on a hypothesis of<br>

complete independence between all the words in the ngram. But, while<br>

simple I would say that it has some mathematical validity.<br>

<br>

Whether or not it provides useful results is of course the real test.<br>

In truth as you extend these measures beyond bigrams you usually end<br>

up with scores (in the case of loglikelihood at least) that almost<br>

always translate into statistical significance, since a longer ngram<br>

really has no chance of simply occurring by chance (and indeed the<br>

same is true of shorter ngrams, but at least statistically they can<br>

appear to be chance events). So, I would urge caution in trying to<br>

assign statistical significance to these values, but I do think they<br>

can be useful when used for relative rankings.<br>

<br>

More about NSP can be found here : <a href="http://ngram.sourceforge.net" target="_blank">http://ngram.sourceforge.net</a><br>

<br>

Cordially,<br>

Ted<br>

<div><div><br>

On Wed, May 15, 2013 at 2:57 PM, Lushan Han <<a href="mailto:lushan1@umbc.edu" target="_blank">lushan1@umbc.edu</a>> wrote:<br>

> Then, is there any solid, mathematically-based association measure between<br>

> three or more variables?<br>

><br>

> Thanks,<br>

><br>

> Lushan Han<br>

><br>

><br>

> On Wed, May 15, 2013 at 2:53 PM, Chris Brew <<a href="mailto:christopher.brew@gmail.com" target="_blank">christopher.brew@gmail.com</a>><br>

> wrote:<br>

>><br>

>> The mutual information score that lexicographers use is a close relative<br>

>> of the mathematical notion of mutual information between two random<br>

>> variables. Peter Turney and others have been careful to reflect this<br>

>> distinction by using the term 'pointwise mutual information' (PMI)  for the<br>

>> lexicographer's version and MI for the other.  Technically, MI is the sum<br>

>> over all cells of a two dimensional matrix of the PMI. This means that you<br>

>> can begin to think of PMI as something like "the contribution of a<br>

>> particular pair of words to MI". And lexicographers have had fair success<br>

>> interpreting it this way. The mathematicians tend to look askance at PMI,<br>

>> because of concerns like "the PMI for a pair of words can in principle be<br>

>> negative even when the MI summed over all words is positive. What (the hell)<br>

>> does that mean?"<br>

>><br>

>> MI is a central notion of information theory, and backed by many useful<br>

>> mathematical results. For the task of measuring word association, the<br>

>> mathematical advantages<br>

>> of MI do not really translate into a preference for using PMI rather than<br>

>> some other measure of association. If it works for you, that's OK. You don't<br>

>> get much extra from the connection to the mathematics.<br>

>><br>

>> Once you move to three or more terms, things get even more complex. The<br>

>> generalizations of MI to three or more terms are confusing in themselves,<br>

>> just because interactions between three or more variables are much more<br>

>> complicated than interactions between just two. The generalizations of PMI<br>

>> would be at least as messy, possibly worse, so it is no surprise that<br>

>> mathematical support for such generalizations is missing.<br>

>><br>

>><br>

>><br>

>><br>

>><br>

>> On Tue, May 14, 2013 at 10:14 AM, Mike Scott <<a href="mailto:mike@lexically.net" target="_blank">mike@lexically.net</a>> wrote:<br>

>>><br>

>>> I have had a query about MI (or any other similar statistic) involving<br>

>>> more than two elements:<br>

>>><br>

>>> "I don't know how to calculate the Mutual Information (MI) for these<br>

>>> 4-word lexical bundles, it seems I can only find the MI score for 2-word<br>

>>> collocations."<br>

>>><br>

>>> Can anyone advise please?<br>

>>><br>

>>> Cheers -- Mike<br>

>>><br>

>>> --<br>

>>> Mike Scott<br>

>>><br>

>>> ***<br>

>>> If you publish research which uses WordSmith, do let me know so I can<br>

>>> include it at<br>

>>><br>

>>> <a href="http://www.lexically.net/wordsmith/corpus_linguistics_links/papers_using_wordsmith.htm" target="_blank">http://www.lexically.net/wordsmith/corpus_linguistics_links/papers_using_wordsmith.htm</a><br>


>>> ***<br>

>>> University of Aston and Lexical Analysis Software Ltd.<br>

>>> <a href="mailto:mike.scott@aston.ac.uk" target="_blank">mike.scott@aston.ac.uk</a><br>

>>> <a href="http://www.lexically.net" target="_blank">www.lexically.net</a><br>

>>><br>

>>><br>

>>> _______________________________________________<br>

>>> UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/corpora</a><br>

>>> Corpora mailing list<br>

>>> <a href="mailto:Corpora@uib.no" target="_blank">Corpora@uib.no</a><br>

>>> <a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>

>>><br>

>><br>

>><br>

>><br>

>> --<br>

>> Chris Brew<br>

>><br>

>> _______________________________________________<br>

>> UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/corpora</a><br>

>> Corpora mailing list<br>

>> <a href="mailto:Corpora@uib.no" target="_blank">Corpora@uib.no</a><br>

>> <a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>

>><br>

><br>

><br>

> _______________________________________________<br>

> UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/corpora</a><br>

> Corpora mailing list<br>

> <a href="mailto:Corpora@uib.no" target="_blank">Corpora@uib.no</a><br>

> <a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>

><br>

<br>

_______________________________________________<br>

UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/corpora</a><br>

Corpora mailing list<br>

<a href="mailto:Corpora@uib.no" target="_blank">Corpora@uib.no</a><br>

<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>

</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br><div dir="ltr">Chris Brew</div>

</div></div></div></div>