[Corpora-List] Chi-Square

Jin-Dong Kim jdkim at is.s.u-tokyo.ac.jp
Sun Sep 17 15:16:27 UTC 2006


One of the reasons of not using chi-square for text processing would
be its requirment that each event has to be observed at least five
times to get realiable statistics, which is not always the case in
text processing.
Dunning's log-likelihood is a kind of appoximation of chi-square which
is known to perform reasonably well for not fequently observed events.
It is also known to approach to chi-square when each event is observed
frequently enough.

Regards,

Jin-Dong


On 9/17/06, Marco Baroni <baroni at sslmit.unibo.it> wrote:
> You can see the comparison of chi-square and log-likelihood ratio in this
> famous paper, that I think was very influential in giving the Chi-square
> test a bad name:
>
> T. Dunning, "Accurate Methods for the Statistics of Surprise and
> Coincidence," Computational Linguistics 19(1), 1993.
> http://citeseer.ist.psu.edu/dunning93accurate.html
>
> The paper is quite mathematical, but the basic idea and the empirical
> comparison part should be quite clear... (although the alternative to
> chi-square should be something like the log-likelihood ratio test, not MI,
> that has the same problem of overestimation of the significance of the
> co-occurrence of rare words that the chi-square test has...)
>
>
> Regards,
>
> Marco
>
>



More information about the Corpora mailing list