[Corpora-List] chi square
Tina Waldman
wald at macam.ac.il
Tue Feb 13 20:19:09 UTC 2007
Dear all,
I was asked to post the results I received following by request about comparing corpora using chi square.
I want to thank Professor Butler and Gaetanelle Guilquin whose responses are posted below.
You have another problem, which is that chi-square should be used only on RAW frequencies, not on normalised data. One way of getting around your problems might be to take the raw data and calculate the values in the cells of the following 2 x 2 table:
Corpus A Corpus B
Number of running words N1 N2
involved in collocations
Number of running words N3 N4
not involved in collocations
Then:
(N1 + N2) will be the total number of running words involved in collocations in the two corpora
(N3 + N4) will be the total number of running words not involved in collocations in the two corpora
(N1 + N3) will be the total number of running words in corpus A
(N2 + N4) will be the total number of running words in corpus B
(N1 + N2 + N3 + N4) will be the total number of running words in both corpora taken together
You then calculate chi-square on the 2 x 2 table, remembering that strictly speaking Yates' correction is needed for such tables, though it is more important where frequencies are small, and so may make little difference in your case.
You can calculate the chi square with one
of the following chi square calculators
http://www.georgetown.edu/faculty/ballc/webtools/web_chi.html
http://www.psych.ku.edu/preacher/chisq/chisq.htm
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20070213/2061a7c0/attachment.htm>
More information about the Corpora
mailing list