[Corpora-List] Significance test for TTR

Benjamin Allison ballison at staffmail.ed.ac.uk
Mon Nov 21 10:20:50 UTC 2011


Chris,

I will assume you have good reason for asking about vocabulary  
richness measures, mindful of the fact (as David and others point out)  
that vocabulary richness might not be all that useful, and concentrate  
on the technical question.

TTR is a bad idea to use on at least two counts: one technical and the  
other practical. The practical concern is that it's highly dependent  
on text length, as others point out, so you'd have to use the same  
size sample which would mean throwing out some data (always a bad  
place to start with statistics!). The second is that there's no reason  
to believe it has any kind of distribution in particular, so you'd end  
up just using something out of the box which wouldn't be appropriate  
(like the t-test).

A better measure, stable across text lengths in my experience, is the  
proportion of word token pairs in the text which are the same word,  
i.e. if your corpus were:

w_1 w_2 w_3 w_1 w_2

there are 10 pairs:

w_1 w_2
w_1 w_3
w_1 w_1 *
w_1 w_2
w_2 w_3
w_2 w_1
w_2 w_2 *
w_3 w_1
w_3 w_2
w_1 w_2

of which two are the same word. This can be viewed as a binomial  
parameter, where the ML estimator in this case would be 2/10, and so  
if you have two corpora you wish to compare you're looking to compare  
you can use a test for comparing binomial parameters.

There are lots of ways to go about this, but most people will suggest  
using a normal approximation to the binomial and then testing for  
different means. Beware if the sample sizes are very different, the  
assumption of equal variance will not hold. There are methods for  
testing binomial populations directly, but they're a bit more involved.

If you're tempted to go down the road of assuming normality, I'd  
suggest the logistic transform of the parameters first (since your  
estimated parameters will be close to 0, which is where normal  
approximations break down), and if I recall correctly the  
approximation gets better still if you take the difference of your  
(transformed) parameters and test for a mean of this significantly  
different to zero.

One final word of caution--most tests will be assuming independent  
samples, and this will not hold in this case, and here's why. If you  
observe in one sample a value of, say, 0.00001, this tells you  
something about the likely distribution of the statistic in the other  
sample (for example, 0.99999 would be a pretty unlikely outcome...). A  
more realistic scenario would be where the values of the statistics in  
both samples are drawn from some common, underlying prior, but that's  
probably going to get quite involved and I'm not sure how accurate you  
want the test to be. In any case, the effect that this will have (just  
so you're aware) is that you may judge your two samples to be from the  
same population when in fact they're different, it's just that the  
range of possible values is far narrower than you're allowing.

Hope that helps.

B

Quoting "David L. Hoover" <david.hoover at nyu.edu> on Sun, 20 Nov 2011  
22:43:54 -0500:

> Dear Chris,
>
> George has given a good explanation of some of the problems. A much  
> more severe problem is that lexical diversity/vocabulary richness is  
> simply not a very reliable statistic for differentiating  
> texts/authors. Although Tweedie and Baayen conclude that it can be  
> used with caution, my own research has shown that lexical diversity  
> shows extreme fluctuation within the works of a single author and  
> even between different sections of the same text. Perhaps there  
> might be a more systematic and reliable difference between text  
> types than between authors or texts, but lexical diversity is so  
> variable that even this doesn't seem very likely. For more detail ,  
> see my
> “Another Perspective on Vocabulary Richness.” Computers and the  
> Humanities, 37(2), 2003: 151-78.
>
> Best,
> David Hoover
>
> On 11/20/2011 1:00 PM, Georgios Mikros wrote:
>>
>> Dear Chris,
>>
>> First things first. TTR is highly dependent to text length so you  
>> have to be sure that the measurements have been taken from equal  
>> size text samples. Otherwise you should use a more robust index  
>> such as Yule’s K or Zipf’s Z (see the [1] for a detailed  
>> description of this problem). Now coming to your original question,  
>> TTR is a continuous variable and you could use the whole range of  
>> parametric statistics. This means that you can use a t-test if you  
>> want to check whether TTR is significant different across two  
>> classes (e.g. Gender distinction in essays), or ANOVA if your  
>> independent variable has many classes (e.g. Text Genre, Text Topic  
>> etc). You can also implement a linear regression model with  
>> dependent variable TTR and independent variables the ones that  
>> describe your research hypothesis. In all the above cases you need  
>> multiple TTR measurements because inferential statistics are based  
>> on the distribution parameters of the TTR. There is also the option  
>> to compare a single TTR value to a distribution of TTR values using  
>> one-sample location test (also called Z test) which actually can  
>> tell you how the specific TTR value lies away from the mean of the  
>> TTRs.
>>
>> If the only thing you know are just 2 TTR values I don’t think you  
>> can compare them in any meaningful way.
>>
>> Best
>>
>> George Mikros
>>
>> [1] Tweedie, Fiona J., & Baayen, Harald R. (1998). How variable may  
>> a constant be? Measures of lexical richness in perspective.  
>> Computers and the Humanities, 32(5), 323-352.
>>
>> ____________________________
>>
>> George K. Mikros
>>
>> Associate Professor of Computational and Quantitative Linguistics
>>
>> Department of Italian Language and Literature
>>
>> School of Philosophy
>>
>> National and Kapodistrian University of Athens
>>
>> Panepistimioupoli Zografou, GR-15784
>>
>> Athens, Greece
>>
>> Tel: +30 210 7277491, +30 6976111742
>>
>> Email: gmikros at isll.uoa.gr <mailto:gmikros at isll.uoa.gr>
>>
>> Web: http://users.uoa.gr/~gmikros/ <http://users.uoa.gr/%7Egmikros/>
>>
>> *From:*corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] *On  
>> Behalf Of *CRuehlemann at aol.com
>> *Sent:* Sunday, November 20, 2011 7:21 PM
>> *To:* CORPORA at uib.no
>> *Subject:* [Corpora-List] Significance test for TTR
>>
>> Hi all,
>>
>> The type token ratio (TTR) is a measure of the lexical diversity of  
>> a text/text type. If one finds in two texts/text types two widely  
>> differing TTRs, one would like to assess the significance of this  
>> finding.
>>
>> Which test is appropriate for differences between TTRs?
>>
>> Best
>>
>> Chris
>>
>>
>>
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list