[Corpora-List] Distribution of mutual information?

Wed Mar 18 15:04:39 UTC 2009

[retry with smaller plot, as message > 40 kB to corpora list was  
blocked]

Hi Linas!

> Will, I've taken a more circuitous route, but not yet arrived
> at your results. At the moment, I've been trying to generate
> random texts. It appears that my random texts are not
> Zipfian enough, they're rather stair-steppy; and this produces
> MI graphs  with  Gaussian fall-offs to the sides -- I think I'm
> learning that its not as easy to create a Zipfian distribution
> as it is made out to be.

Sampling from a Zipfian distribution isn't difficult at all, at least  
if you're running R (btw, if you're doing a lot of exploratory data  
analysis, you should consider R -- it's an excellent environment for  
this kind of research):

library(zipfR)
population <- lnre("zm", alpha=.5, B=.05)  # Zipf-Mandelbrot law with  
a=2
w1 <- rlnre(population, 10e6)
w2 <- rlnre(population, 10e6)

With a little patience, this gives you a sample of 10 million random  
co-occurrence tokens, where both components (w1 and w2) are sampled  
from the same Zipfian distribution.  Calculating contingency tables  
and MI scores is a little tricky, though, and takes some more  
patience ...

Anyway, I get the following distribution of MI scores in this  
simulated population, which -- by definition -- has no statistical  
associations at all between words:

-------------- next part --------------
A non-text attachment was scrubbed...
Name: MI_simulated.png
Type: image/png
Size: 23643 bytes
Desc: not available
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20090318/b95c9d7b/attachment-0001.png>
-------------- next part --------------

The general shape is similar to what you found for your real-world  
data, especially the nearly log-linear density curves for positive and  
negative scores with a sharp bend at MI = 0.

I think it's enlightening to split data into frequency layers, as I've  
done in the plot above.  It's obvious now that the hapax legomena  
(f=1, 338838 out of 565926 pair types in my sample, i.e. about 60%)  
account for the bulk of the distribution and in particular for all  
larger MI scores, which determine the overall shape of the  
distribution.  So if you want to solve the "riddle" behind this  
distribution, you should probably take a closer look at the  
mathematics of MI for hapax legomena (where the MI score is determined  
only by the expected frequency, i.e. the product of the marginals).

I can't see anything like the bulge you found in your data for  
slightly positive MI scores, so this might deserve further  
investigation. Possible explanations that come to mind are: (i) an  
artefact of the data you used (or data preparation); (ii) a different  
proportions of hapax legomena (f=1) or dis legomena (f=2) that "show"  
through in the overall distribution.  In the plot above, it is obvious  
that the mode of the distribution for low-frequency data is shifted  
slightly towards positive MI scores, while for f > 5 there is a sharp  
peak at MI = 0.

Best wishes, and thanks for the fun example!
Stefan

--
The wonders of Googleology (episode 1)

"from collectibles to cars"
	84,700,000 -- Google
	9,443,672 -- Google N-grams (Web 1T5)
	1 -- ukWaC

[ stefan.evert at uos.de | http://purl.org/stefan.evert ]

-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora