[Corpora-List] chi square

Tina Waldman wald at macam.ac.il
Tue Feb 13 20:19:09 UTC 2007


Dear all,

I was asked to post the results I received following by request about comparing corpora using chi square.

I want to thank Professor Butler and Gaetanelle Guilquin whose responses are posted below.

You have another problem, which is that chi-square should be used only on RAW frequencies, not on normalised data. One way of getting around your problems might be to take the raw data and calculate the values in the cells of the following 2 x 2 table:

                                                Corpus A            Corpus B

Number of running words                N1                        N2
involved in collocations

Number of running words                N3                        N4
not involved in collocations

Then: 
(N1 + N2) will be the total number of running words involved in collocations in the two corpora
(N3 + N4) will be the total number of running words not involved in collocations in the two corpora
(N1 + N3) will be the total number of running words in corpus A
(N2 + N4) will be the total number of running words in corpus B
(N1 + N2 + N3 + N4) will be the total number of running words in both corpora taken together

You then calculate chi-square on the 2 x 2 table, remembering that strictly speaking Yates' correction is needed for such tables, though it is more important where frequencies are small, and so may make little difference in your case.


You can calculate the chi square with one  
of the following chi square calculators

http://www.georgetown.edu/faculty/ballc/webtools/web_chi.html

http://www.psych.ku.edu/preacher/chisq/chisq.htm

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20070213/2061a7c0/attachment.htm>


More information about the Corpora mailing list