Chi-square
Joseph Davis
jdavis at ccny.cuny.edu
Fri Sep 3 21:30:16 UTC 2010
A colleague sent to me the June 28 posting by
Yuri Tambovtsev below, to which I offer a belated reply that may be useful.
The Use of Chi-square by Yuri Tambovtsev
Adam Kilgariff wrote that it is not possible to
use Chi-square in corpus linguistics. I do not
think it is true. One can use Chi-square in
linguistics in all cases under the condition that
one keeps to the principle of commensurability.
That is here, if two samples are equal. I have
counted the occurrence of labial consonants in
the equal samples of 10000 speech sounds of
different Estonian and Russian authors. For
instance, in the text of the Estonian writer
Aarne Biin «Moetleja» and Enn Vetemaa «Neitsist
Suendinud» labial consonants occur 896 and 962
times. Could we say that statistically it is the
same? So, we put forward the null hypothesis
under the 5% level of significance and one degree
of freedom. The theoretical threshold value for
Chi-square is 3.841. The actual Chi-square value
should be less than 3.841 to state that the
occurrence of labial consonant in these two
samples is the same. We calculated the Chi-square
between 896 and 962. It is 2.344. Thus, it is
less than 3.841. So, the two text samples enter
the same general sample or in other words it is
statistically the same. I wonder if my reasoning is correct.
[End of quotation from June 28 posting by Yuri Tambovtsev]
The main requirement for the use of the
chi-square test of significance is that the
observations (data points, tokens) in the sample
of some population be statistically
independent. That is, there should be no
statistical relation between one observation and
another in the data set. It should not be
possible, given the occurrence of one
observation, to predict the next observation or
any other observation. In my experience, such
independence among observations typically is not
a property of connected discourse. Rather, the
occurrence of one observation typically raises
the probability of the same type of observation
occurring next or later in the discourse, no
doubt because connected discourse is typically coherent, not random.
For instance, if a text in English concerns
largely the topic of peace, then there will
likely be many instances in the text of the
labial [p], due to the frequency of the word
peace and related words (peaceful, pacify,
peacenik, etc.). By contrast, if another text
is about health, then it will have a
disproportionately high frequency of [h],
relative to [p]. Consequently, given any
occurrence of a labial in the first text, there
will be a somewhat elevated probability for
occurrence of a labial next or soon; versus the
possibility of predicting another [h] in the
second text. This is statistical dependence, not
independence. As a result, chi-square is not
appropriate as a test of significance; it will
likely give an inaccurate measure of the degree
to which the sample of labials is representative
of the larger population of discourse from which
the sample was drawn. (In this case, I suppose
we can only imagine a hypothetical population of
English discourse from which our text was in
some idealistic sense drawn-another reason the
use of a statistical test of significance may be
inappropriate: a text is not in any real sense a sample from a population.)
I have a chapter from several years ago that
addresses this problem in relation to somewhat
different analytical concerns. The reference
is: Joseph Davis, 2002, Rethinking the place of
statistics in Columbia School analysis, in
Wallis Reid, Ricardo Otheguy, and Nancy Stern
(eds.), Signal, meaning, and
message: Perspectives on sign-based linguistics
(pp. 65-90). Amsterdam/Philadelphia: John Benjamins.
Joseph Davis, Ph.D.
Associate Professor
School of Education, NAC 6207
The City College
New York, NY 10031
More information about the Funknet
mailing list