[Corpora-List] Chi-Square

ted pedersen tpederse at d.umn.edu
Sun Sep 17 15:44:36 UTC 2006


On Mon, 18 Sep 2006, Jin-Dong Kim wrote:

> One of the reasons of not using chi-square for text processing would
> be its requirment that each event has to be observed at least five
> times to get realiable statistics, which is not always the case in
> text processing.
> Dunning's log-likelihood is a kind of appoximation of chi-square which
> is known to perform reasonably well for not fequently observed events.
> It is also known to approach to chi-square when each event is observed
> frequently enough.
> 
> Regards,
> 
> Jin-Dong
> 

Greetings collocationalists, 

Just to elaborate a little, log-likelihood also has the "requirement"  
that each event be observed 5 times, although there are other requirements 
that both must adhere to as well (like the distribution of counts should  
not be too skewed, etc.). Of course we typically violate these with  
reckless abandon in NLP. :)

Chi-squared and log-likelihood are quite closely related (members of the  
same family of test) so when one works reasonably well the other probably  
does too, and when one is unreliable the other might be too. Some of this  
is summarized in an earlier note to this list, and in fact some of 
preceding and following messages are also quite relevant:

http://torvald.aksis.uib.no/corpora/1997-1/0160.html

BTW, there is a url mentioned in that note that does not exist any longer,  
it has been replaced by http://www.d.umn.edu/~tpederse/pubs.html should
that seem relevant. 

I strongly encourage anyone interested in these issues to look carefully 
at Read and Cressie (1988), which is cited more fully in the note above.
Among other things, this lays out the history of the log-likelihood   
ratio and the Chi-squared test, and actually tells a rather dramatic  
story of how they have been in competition since the 1920's or so! 

I think Read and Cressie are in some ways trying to mend the rift between 
the two measures, and show that rather than these measures being enemies  
they are in fact members of the same family, and you can tell alot about  
one by looking at the other. Anyway, it's a nice book, highly recommened  
both for the technical content and the historical perspective it provides.  

Cordially,
Ted

--
Ted Pedersen
http://www.d.umn.edu/~tpederse



More information about the Corpora mailing list