[Corpora-List] Chi-Square

Sun Sep 17 21:33:44 UTC 2006

Crayton,

I've had a go at explaining just this to non-mathematicians in a recent
paper called "Language is never ever ever random", see
http://www.kilgarriff.co.uk/publications.htm 

Here's the core reason (taken from the abstract)

Language users never choose words randomly, and language is essentially
non-random. Statistical hypothesis testing [eg chi-square] uses a null
hypothesis, which
posits randomness. Hence, when we look at linguistic phenomena in corpora,
the null hypothesis will never be true.  Moreover, where there is enough
data, we shall (almost) always be able to establish that it is not true. In
corpus studies, we frequently do have enough data, so the fact that a
relation between two phenomena is demonstrably non-random, does not support
the inference that it is not arbitrary.

Adam

Crayton Walker wrote:

> A simple question about statistical measures.
>
> Could someone explain in very simple terms why we don't normally use
> Chi-square as a measure of collocational significance? We tend to use 
> t-score and MI and not Chi-square. Why not? I am not a mathematician 
> so would appreciate it if you could keep it simple.
>
> Many thanks
>
> Crayton Walker
>
> University of Birmingham
>