[Corpora-List] Fwd: Re: Distribution of mutual information?
J Washtell
lec3jrw at leeds.ac.uk
Mon Mar 16 00:00:54 UTC 2009
For those interested in Linas' MI plot, John Goldsmith pointed out
(below) that my previous post and plot on this topic was somewhat
confusing and might benefit from clarification - particularly with
regards to the significance of log(p(x1,y1)/p(x2/y2)). Upon
re-reading, I concur - this is very cryptic :-)
The plot is intended to illustrate that the source of the shape of
Linas' distribution is simply the independent zipf distibutions of the
word tokens, and the general structure of MI (i.e. log(a/b)).
So, specifically, Mutual information is defined as:
log ( p(x,y) / p(x)p(y) )
Which Linas showed produced an interesting distribution; interesting
in the first instance due to the fact that while not being immediately
recognizable as any "classic" distibution, it is nonetheless very
simple geometrically speaking.
One could go to the trouble of plotting log ( p(x,y) / p(x)p(y) ) for
a *randomly* generated corpus (where the token frequencies observe
Zipfs law, but there is no associative structure), to convince oneself
that the shape does not arise from any linguistic phenonemon. However,
besides being a little bit of trouble to produce, this might
nonetheless cloud the root cause of the observed distribution by
suggesting that it has something to do with the distribution of
observed co-occurrences (the joint probability), even in a random
corpus - which it does not.
Rather, the pertinent fact is that the numerator and the denominator
each comprise of variables having Zipf distributions. Plotting log(
p(x1)p(y1) /p(x2)(y2) ) -- x1,y1,x2 & y2 all being independent
zipf-distributed variables -- which does not take any co-occurrence
into account yet still produces this same shape, is a good way to
illustrate this... and was very easy to mock up with a few tens of
thousands of randomly generated numbers in a spreadsheed.
This formula is comparable to MI insomuch as it is the ratio of the
distribution p(x)p(y) - where p(x) and p(y) are zipf distributed - to
one which is very very similar, and insomuch as that it produces the
same characteristic symmetric log-linear shape.
The implication therefore is that the *linguistically* pertinient
features of Linas' distribution are manifest in its deviation from
this shape: A) the rightward skew due to p(x,y) capturing actual
associative structure in the language, B) as Linas observes (and I
agree) is much more interesting, the pronounced kinks.
Best regards,
Justin Washtell
University of Leeds
>
>>
>> Please compare the attached plot to yours. It is a probability
>> distribution over log(p(x1)p(y1)/p(x2)p(y2)), where x and y exhibit
>> approximate zipf distributions. In other words, it is comparable
>> to calculating MI upon a random corpus which has no associative
>> structure.
> Could you post something in which you explain a bit more what you did
> (and perhaps even why)? Since MI specifically compares joint to
> marginal probabilities, I'm having trouble seeing why your expression
> is comparable to MI.
> thanks,
> John Goldsmith
----- End forwarded message -----
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list