[Corpora-List] Distribution of mutual information?
Stefan Evert
stefan.evert at uos.de
Wed Mar 18 15:04:39 UTC 2009
[retry with smaller plot, as message > 40 kB to corpora list was
blocked]
Hi Linas!
> Will, I've taken a more circuitous route, but not yet arrived
> at your results. At the moment, I've been trying to generate
> random texts. It appears that my random texts are not
> Zipfian enough, they're rather stair-steppy; and this produces
> MI graphs with Gaussian fall-offs to the sides -- I think I'm
> learning that its not as easy to create a Zipfian distribution
> as it is made out to be.
Sampling from a Zipfian distribution isn't difficult at all, at least
if you're running R (btw, if you're doing a lot of exploratory data
analysis, you should consider R -- it's an excellent environment for
this kind of research):
library(zipfR)
population <- lnre("zm", alpha=.5, B=.05) # Zipf-Mandelbrot law with
a=2
w1 <- rlnre(population, 10e6)
w2 <- rlnre(population, 10e6)
With a little patience, this gives you a sample of 10 million random
co-occurrence tokens, where both components (w1 and w2) are sampled
from the same Zipfian distribution. Calculating contingency tables
and MI scores is a little tricky, though, and takes some more
patience ...
Anyway, I get the following distribution of MI scores in this
simulated population, which -- by definition -- has no statistical
associations at all between words:
-------------- next part --------------
A non-text attachment was scrubbed...
Name: MI_simulated.png
Type: image/png
Size: 23643 bytes
Desc: not available
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20090318/b95c9d7b/attachment-0001.png>
-------------- next part --------------
The general shape is similar to what you found for your real-world
data, especially the nearly log-linear density curves for positive and
negative scores with a sharp bend at MI = 0.
I think it's enlightening to split data into frequency layers, as I've
done in the plot above. It's obvious now that the hapax legomena
(f=1, 338838 out of 565926 pair types in my sample, i.e. about 60%)
account for the bulk of the distribution and in particular for all
larger MI scores, which determine the overall shape of the
distribution. So if you want to solve the "riddle" behind this
distribution, you should probably take a closer look at the
mathematics of MI for hapax legomena (where the MI score is determined
only by the expected frequency, i.e. the product of the marginals).
I can't see anything like the bulge you found in your data for
slightly positive MI scores, so this might deserve further
investigation. Possible explanations that come to mind are: (i) an
artefact of the data you used (or data preparation); (ii) a different
proportions of hapax legomena (f=1) or dis legomena (f=2) that "show"
through in the overall distribution. In the plot above, it is obvious
that the mode of the distribution for low-frequency data is shifted
slightly towards positive MI scores, while for f > 5 there is a sharp
peak at MI = 0.
Best wishes, and thanks for the fun example!
Stefan
--
The wonders of Googleology (episode 1)
"from collectibles to cars"
84,700,000 -- Google
9,443,672 -- Google N-grams (Web 1T5)
1 -- ukWaC
[ stefan.evert at uos.de | http://purl.org/stefan.evert ]
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list