[Corpora-List] Questions about t-score
Stefan Evert
stefan.evert at collocations.de
Thu Apr 9 17:02:03 UTC 2009
> I am writing to know whether the t-score used in corpus
> analysis is the same t-score used in regular statistics.
I'm afraid so, which means that it's entirely inapplicable to corpus
data and the resulting p-values cannot be interpreted in any
meaningful way. I complain about this at length here:
http://www.collocations.de/AM/section4.html#s4.1
> That is, if I am,
> for example, looking for the collocation ‘wing’ and ‘angel,’ and I
> find
> that these two words occur together 75 times with a t-score value of
> 4.3,
> can I say that the df (degree of freedom) is 75-1=74, and then go to
> the
> t-score table and try to find whether my result is statistically
> significant, i.e. p<0.05?
No, because the assumption made by the test are so far off the mark in
this case that the test statistic doesn't even remotely follow a t
distribution. Empirical results and simulation experiments show that
t-score underestimates significance drastically (i.e. p-values are
much higher than for the mathematically appropriate Fisher exact
test); this behaviour is often desirable in the context of collocation
extraction, which accounts for the popularity of t-score.
If you really want to calculate p-values, you should use Fisher's test
on 2x2 contingency tables. You'll find, though, that most word pairs
appear to be significant with p < .05 (and even quite often p < .001).
I cannot resist a little bit of self-promotion: you might want to look
at my PhD thesis
Evert, Stefan (2004). The Statistics of Word Cooccurrences: Word Pairs
and Collocations. Dissertation, Institut für maschinelle
Sprachverarbeitung, University of Stuttgart. Published in 2005, URN urn:nbn:de:bsz:93-opus-23714
.
or this handbook chapter
Evert, Stefan (2008). Corpora and collocations. In A. Lüdeling and M.
Kytö (eds.), Corpus Linguistics. An International Handbook, chapter
58. Mouton de Gruyter, Berlin.
which have extensive discussions of statistical measures of
association. Both can be downloaded from my homepage (see below).
Best regards,
Stefan Evert
[ stefan.evert at uos.de | http://purl.org/stefan.evert ]
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list