[Corpora-List] Distribution of mutual information?

Linas Vepstas linasvepstas at gmail.com
Wed Mar 18 19:54:10 UTC 2009


2009/3/18 Stefan Evert <stefan.evert at uos.de>:
>>
>
> Hi Linas!
>
>> Will, I've taken a more circuitous route, but not yet arrived
>> at your results. At the moment, I've been trying to generate
>> random texts. It appears that my random texts are not
>> Zipfian enough, they're rather stair-steppy; and this produces
>> MI graphs  with  Gaussian fall-offs to the sides -- I think I'm
>> learning that its not as easy to create a Zipfian distribution
>> as it is made out to be.
>
> Sampling from a Zipfian distribution isn't difficult at all,

Yes; however, I was referring to something slightly different.
There is a paper by Wentian Li, 1997, that claims that
if one starts with an alphabet of N letters, and then
generate random words by picking letters randomly
from this alphabet, that one gets a Zipfian distribution.
He supplies a mathematical proof that bounds his
distribution above and below by Zipfian distributions,
as well as some graphs of generated data.

However, his *actual* distribution is stair-steppy: although
its bounded above and below by Zipf distributions, its not
actually a straight line.  When I used his algo as a source
of word pairs, I discovered that the MI distribution had a
Gaussian fall-off, instead of a log-linear fall-off.

What I conclude here is that Zipf's law is not as trivially
obtainable as Li implies, and that perhaps the "common
knowledge" that random texts are Zipfian may actually be
incorrect -- or at least, begs the question, "what do you
mean when you say 'random text'?"

> Anyway, I get the following distribution of MI scores in this simulated
> population, which -- by definition -- has no statistical associations at all
> between words:

I was intrigued that when I generated 'random text', I
obtained curves that had a strong offset to positive
MI values.  This has a simple explanation: One would
have to have a truly immense sample size (trillions of
words) before one saw all possible word pairs equally
often.  For small sample sizes (10's or 100's of
millions of word pairs), some word pairs will be
observed a lot, while others will not be observed at all.

Thus, even though there's no statistical association at
all between words in this random text, one can still
see quite strong correlations, and an MI distribution
that's mostly positive.   Hmm.  This muddies the
intuitive interpretation of what positive MI really means.


> I think it's enlightening to split data into frequency layers, as I've done
> in the plot above.  It's obvious now that the hapax legomena (f=1, 338838
> out of 565926 pair types in my sample, i.e. about 60%)

I also played with graphs of number_of_occurances x MI,
which have similar but slightly different shapes.

> I can't see anything like the bulge you found in your data for
> slightly positive MI scores, so this might deserve further
> investigation. Possible explanations that come to mind are: (i) > an artefact of the data you used (or data preparation); (ii) a
> different proportions of hapax legomena (f=1) or dis legomena
> (f=2) that "show" through in the overall distribution.  In the plot
> above, it is obvious that the mode of the distribution for
> low-frequency data is shifted slightly towards positive MI
> scores, while for f > 5 there is a sharp peak at MI = 0.

There's an unstated, implicit question here: how does the
MI distribution for random text differ from that of real text?

The answer is, presumably, that, for random text, the f>5
data is peaked at MI=0, while for real text, its peaked
at positive MI. But yes, discarding the low-frequency
legomena from both random and real text seems to be the
right thing to do.  I'll try to gen some of those figures shortly.

 --linas

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list