<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=Content-Type content="text/html; charset=windows-1255">
<META content="MSHTML 6.00.2900.3020" name=GENERATOR>
<STYLE></STYLE>
</HEAD>
<BODY bgColor=#ffffff>
<DIV><FONT face=Arial size=2>Dear all,</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>I was asked to post the results I received
following by request about comparing corpora using chi square.</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>I want to thank Professor Butler and Gaetanelle
Guilquin whose responses are posted below.</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>
<DIV><FONT size=2>You have another problem, which is that chi-square should be
used only on RAW frequencies, not on normalised data. One way of getting around
your problems might be to take the raw data and calculate the values in the
cells of the following 2 x 2 table:</FONT></DIV>
<DIV><FONT size=2></FONT> </DIV>
<DIV><FONT size=2>
Corpus A
Corpus B</FONT></DIV>
<DIV><FONT size=2></FONT> </DIV>
<DIV><FONT size=2>Number of running words
N1
N2</FONT></DIV>
<DIV><FONT size=2>involved in collocations</FONT></DIV>
<DIV><FONT size=2></FONT> </DIV>
<DIV><FONT size=2>Number of running words
N3
N4</FONT></DIV>
<DIV><FONT size=2>not involved in collocations</FONT></DIV>
<DIV><FONT size=2></FONT> </DIV>
<DIV><FONT size=2>Then: </FONT></DIV>
<DIV><FONT size=2>(N1 + N2) will be the total number of running words involved
in collocations in the two corpora</FONT></DIV>
<DIV><FONT size=2>(N3 + N4) will be the total number of running words not
involved in collocations in the two corpora</FONT></DIV>
<DIV><FONT size=2>(N1 + N3) will be the total number of running words in corpus
A</FONT></DIV>
<DIV><FONT size=2>(N2 + N4) will be the total number of running words in corpus
B</FONT></DIV>
<DIV><FONT size=2>(N1 + N2 + N3 + N4) will be the total number of running words
in both corpora taken together</FONT></DIV>
<DIV><FONT size=2></FONT> </DIV>
<DIV><FONT size=2>You then calculate chi-square on the 2 x 2 table, remembering
that strictly speaking Yates' correction is needed for such tables, though it is
more important where frequencies are small, and so may make little difference in
your case.</FONT></DIV>
<DIV> </DIV>
<DIV> </DIV>
<DIV>You can calculate the chi square with one <BR>of the following chi
square calculators</DIV>
<DIV><FONT size=2></FONT> </DIV>
<DIV><FONT size=2><A
href="http://www.georgetown.edu/faculty/ballc/webtools/web_chi.html">http://www.georgetown.edu/faculty/ballc/webtools/web_chi.html</A></FONT></DIV>
<DIV> </DIV>
<DIV><A
href="http://www.psych.ku.edu/preacher/chisq/chisq.htm">http://www.psych.ku.edu/preacher/chisq/chisq.htm</A></DIV>
<DIV> </DIV>
<DIV> </DIV></FONT></DIV></BODY></HTML>