[Corpora-List] Significance test for TTR

Mon Nov 21 14:06:34 UTC 2011

Hi Chris,

LD measures can be quite robust indicators of lexical use in many areas of research. For the State of the Art in LD research read this paper

McCarthy, P.M. & Jarvis, S., (2010). MTLD, vocd-D, and HD-D: A validation study of sophisticated approaches to lexical diversity Assessment". Behavior Research Methods. 42:381-392. 

Which can be found here for free

http://www.springerlink.com/content/257587jm46601751/

One LD index used by McCarthy and Jarvis (mtld) appears to not be text length dependent; however, there are a variety of other indices that might suit your needs also (d and m come to mind). Tools to compute these indices automatically are available and free and one of these tools (the Grammulator) will even compute statistical analyses for you (I think).

But do read the paper first ;)

Scott

On Nov 21, 2011, at 5:20 AM, Benjamin Allison wrote:

> Chris,
> 
> I will assume you have good reason for asking about vocabulary richness measures, mindful of the fact (as David and others point out) that vocabulary richness might not be all that useful, and concentrate on the technical question.
> 
> TTR is a bad idea to use on at least two counts: one technical and the other practical. The practical concern is that it's highly dependent on text length, as others point out, so you'd have to use the same size sample which would mean throwing out some data (always a bad place to start with statistics!). The second is that there's no reason to believe it has any kind of distribution in particular, so you'd end up just using something out of the box which wouldn't be appropriate (like the t-test).
> 
> A better measure, stable across text lengths in my experience, is the proportion of word token pairs in the text which are the same word, i.e. if your corpus were:
> 
> w_1 w_2 w_3 w_1 w_2
> 
> there are 10 pairs:
> 
> w_1 w_2
> w_1 w_3
> w_1 w_1 *
> w_1 w_2
> w_2 w_3
> w_2 w_1
> w_2 w_2 *
> w_3 w_1
> w_3 w_2
> w_1 w_2
> 
> of which two are the same word. This can be viewed as a binomial parameter, where the ML estimator in this case would be 2/10, and so if you have two corpora you wish to compare you're looking to compare you can use a test for comparing binomial parameters.
> 
> There are lots of ways to go about this, but most people will suggest using a normal approximation to the binomial and then testing for different means. Beware if the sample sizes are very different, the assumption of equal variance will not hold. There are methods for testing binomial populations directly, but they're a bit more involved.
> 
> If you're tempted to go down the road of assuming normality, I'd suggest the logistic transform of the parameters first (since your estimated parameters will be close to 0, which is where normal approximations break down), and if I recall correctly the approximation gets better still if you take the difference of your (transformed) parameters and test for a mean of this significantly different to zero.
> 
> One final word of caution--most tests will be assuming independent samples, and this will not hold in this case, and here's why. If you observe in one sample a value of, say, 0.00001, this tells you something about the likely distribution of the statistic in the other sample (for example, 0.99999 would be a pretty unlikely outcome...). A more realistic scenario would be where the values of the statistics in both samples are drawn from some common, underlying prior, but that's probably going to get quite involved and I'm not sure how accurate you want the test to be. In any case, the effect that this will have (just so you're aware) is that you may judge your two samples to be from the same population when in fact they're different, it's just that the range of possible values is far narrower than you're allowing.
> 
> Hope that helps.
> 
> B
> 
> Quoting "David L. Hoover" <david.hoover at nyu.edu> on Sun, 20 Nov 2011 22:43:54 -0500:
> 
>> Dear Chris,
>> 
>> George has given a good explanation of some of the problems. A much more severe problem is that lexical diversity/vocabulary richness is simply not a very reliable statistic for differentiating texts/authors. Although Tweedie and Baayen conclude that it can be used with caution, my own research has shown that lexical diversity shows extreme fluctuation within the works of a single author and even between different sections of the same text. Perhaps there might be a more systematic and reliable difference between text types than between authors or texts, but lexical diversity is so variable that even this doesn't seem very likely. For more detail , see my
>> “Another Perspective on Vocabulary Richness.” Computers and the Humanities, 37(2), 2003: 151-78.
>> 
>> Best,
>> David Hoover
>> 
>> On 11/20/2011 1:00 PM, Georgios Mikros wrote:
>>> 
>>> Dear Chris,
>>> 
>>> First things first. TTR is highly dependent to text length so you have to be sure that the measurements have been taken from equal size text samples. Otherwise you should use a more robust index such as Yule’s K or Zipf’s Z (see the [1] for a detailed description of this problem). Now coming to your original question, TTR is a continuous variable and you could use the whole range of parametric statistics. This means that you can use a t-test if you want to check whether TTR is significant different across two classes (e.g. Gender distinction in essays), or ANOVA if your independent variable has many classes (e.g. Text Genre, Text Topic etc). You can also implement a linear regression model with dependent variable TTR and independent variables the ones that describe your research hypothesis. In all the above cases you need multiple TTR measurements because inferential statistics are based on the distribution parameters of the TTR. There is also the option to compare a single TTR value to a distribution of TTR values using one-sample location test (also called Z test) which actually can tell you how the specific TTR value lies away from the mean of the TTRs.
>>> 
>>> If the only thing you know are just 2 TTR values I don’t think you can compare them in any meaningful way.
>>> 
>>> Best
>>> 
>>> George Mikros
>>> 
>>> [1] Tweedie, Fiona J., & Baayen, Harald R. (1998). How variable may a constant be? Measures of lexical richness in perspective. Computers and the Humanities, 32(5), 323-352.
>>> 
>>> ____________________________
>>> 
>>> George K. Mikros
>>> 
>>> Associate Professor of Computational and Quantitative Linguistics
>>> 
>>> Department of Italian Language and Literature
>>> 
>>> School of Philosophy
>>> 
>>> National and Kapodistrian University of Athens
>>> 
>>> Panepistimioupoli Zografou, GR-15784
>>> 
>>> Athens, Greece
>>> 
>>> Tel: +30 210 7277491, +30 6976111742
>>> 
>>> Email: gmikros at isll.uoa.gr <mailto:gmikros at isll.uoa.gr>
>>> 
>>> Web: http://users.uoa.gr/~gmikros/ <http://users.uoa.gr/%7Egmikros/>
>>> 
>>> *From:*corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] *On Behalf Of *CRuehlemann at aol.com
>>> *Sent:* Sunday, November 20, 2011 7:21 PM
>>> *To:* CORPORA at uib.no
>>> *Subject:* [Corpora-List] Significance test for TTR
>>> 
>>> Hi all,
>>> 
>>> The type token ratio (TTR) is a measure of the lexical diversity of a text/text type. If one finds in two texts/text types two widely differing TTRs, one would like to assess the significance of this finding.
>>> 
>>> Which test is appropriate for differences between TTRs?
>>> 
>>> Best
>>> 
>>> Chris
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>>> Corpora mailing list
>>> Corpora at uib.no
>>> http://mailman.uib.no/listinfo/corpora
>> 
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>> 
>> 
> 
> 
> 
> -- 
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
> 
> 
> 
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

Scott Crossley, Ph.D.
Department of Applied Linguistics/ESL
Georgia State University
http://www2.gsu.edu/~wwwesl/scottcrossleybio.html

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20111121/c60c5ead/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora