[Corpora-List] Questions about t-score

Stefan Evert stefan.evert at collocations.de
Thu Apr 9 17:02:03 UTC 2009


> I am writing to know whether the t-score used in corpus
> analysis is the same t-score used in regular statistics.

I'm afraid so, which means that it's entirely inapplicable to corpus  
data and the resulting p-values cannot be interpreted in any  
meaningful way.  I complain about this at length here:

	http://www.collocations.de/AM/section4.html#s4.1

> That is, if I am,
> for example, looking for the collocation ‘wing’ and ‘angel,’ and I  
> find
> that these two words occur together 75 times with a t-score value of  
> 4.3,
> can I say that the df (degree of freedom) is 75-1=74, and then go to  
> the
> t-score table and try to find whether my result is statistically
> significant, i.e. p<0.05?

No, because the assumption made by the test are so far off the mark in  
this case that the test statistic doesn't even remotely follow a t  
distribution.  Empirical results and simulation experiments show that  
t-score underestimates significance drastically (i.e. p-values are  
much higher than for the mathematically appropriate Fisher exact  
test); this behaviour is often desirable in the context of collocation  
extraction, which accounts for the popularity of t-score.

If you really want to calculate p-values, you should use Fisher's test  
on 2x2 contingency tables.  You'll find, though, that most word pairs  
appear to be significant with p < .05 (and even quite often p < .001).

I cannot resist a little bit of self-promotion: you might want to look  
at my PhD thesis

Evert, Stefan (2004). The Statistics of Word Cooccurrences: Word Pairs  
and Collocations. Dissertation, Institut für maschinelle  
Sprachverarbeitung, University of Stuttgart. Published in 2005, URN urn:nbn:de:bsz:93-opus-23714 
.

or this handbook chapter

Evert, Stefan (2008). Corpora and collocations. In A. Lüdeling and M.  
Kytö (eds.), Corpus Linguistics. An International Handbook, chapter  
58. Mouton de Gruyter, Berlin.

which have extensive discussions of statistical measures of  
association.  Both can be downloaded from my homepage (see below).


Best regards,
Stefan Evert

[ stefan.evert at uos.de | http://purl.org/stefan.evert ]




_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list