[Corpora-List] Significance test for TTR

David L. Hoover david.hoover at nyu.edu
Mon Nov 21 03:43:54 UTC 2011


Dear Chris,

George has given a good explanation of some of the problems. A much more 
severe problem is that lexical diversity/vocabulary richness is simply 
not a very reliable statistic for differentiating texts/authors. 
Although Tweedie and Baayen conclude that it can be used with caution, 
my own research has shown that lexical diversity shows extreme 
fluctuation within the works of a single author and even between 
different sections of the same text. Perhaps there might be a more 
systematic and reliable difference between text types than between 
authors or texts, but lexical diversity is so variable that even this 
doesn't seem very likely. For more detail , see my
“Another Perspective on Vocabulary Richness.” Computers and the 
Humanities, 37(2), 2003: 151-78.

Best,
David Hoover

On 11/20/2011 1:00 PM, Georgios Mikros wrote:
>
> Dear Chris,
>
> First things first. TTR is highly dependent to text length so you have 
> to be sure that the measurements have been taken from equal size text 
> samples. Otherwise you should use a more robust index such as Yule’s K 
> or Zipf’s Z (see the [1] for a detailed description of this problem). 
> Now coming to your original question, TTR is a continuous variable and 
> you could use the whole range of parametric statistics. This means 
> that you can use a t-test if you want to check whether TTR is 
> significant different across two classes (e.g. Gender distinction in 
> essays), or ANOVA if your independent variable has many classes (e.g. 
> Text Genre, Text Topic etc). You can also implement a linear 
> regression model with dependent variable TTR and independent variables 
> the ones that describe your research hypothesis. In all the above 
> cases you need multiple TTR measurements because inferential 
> statistics are based on the distribution parameters of the TTR. There 
> is also the option to compare a single TTR value to a distribution of 
> TTR values using one-sample location test (also called Z test) which 
> actually can tell you how the specific TTR value lies away from the 
> mean of the TTRs.
>
> If the only thing you know are just 2 TTR values I don’t think you can 
> compare them in any meaningful way.
>
> Best
>
> George Mikros
>
> [1] Tweedie, Fiona J., & Baayen, Harald R. (1998). How variable may a 
> constant be? Measures of lexical richness in perspective. Computers 
> and the Humanities, 32(5), 323-352.
>
> ____________________________
>
> George K. Mikros
>
> Associate Professor of Computational and Quantitative Linguistics
>
> Department of Italian Language and Literature
>
> School of Philosophy
>
> National and Kapodistrian University of Athens
>
> Panepistimioupoli Zografou, GR-15784
>
> Athens, Greece
>
> Tel: +30 210 7277491, +30 6976111742
>
> Email: gmikros at isll.uoa.gr <mailto:gmikros at isll.uoa.gr>
>
> Web: http://users.uoa.gr/~gmikros/ <http://users.uoa.gr/%7Egmikros/>
>
> *From:*corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] *On 
> Behalf Of *CRuehlemann at aol.com
> *Sent:* Sunday, November 20, 2011 7:21 PM
> *To:* CORPORA at uib.no
> *Subject:* [Corpora-List] Significance test for TTR
>
> Hi all,
>
> The type token ratio (TTR) is a measure of the lexical diversity of a 
> text/text type. If one finds in two texts/text types two widely 
> differing TTRs, one would like to assess the significance of this finding.
>
> Which test is appropriate for differences between TTRs?
>
> Best
>
> Chris
>
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list