[Corpora-List] Significance test for TTR
David L. Hoover
david.hoover at nyu.edu
Mon Nov 21 03:43:54 UTC 2011
Dear Chris,
George has given a good explanation of some of the problems. A much more
severe problem is that lexical diversity/vocabulary richness is simply
not a very reliable statistic for differentiating texts/authors.
Although Tweedie and Baayen conclude that it can be used with caution,
my own research has shown that lexical diversity shows extreme
fluctuation within the works of a single author and even between
different sections of the same text. Perhaps there might be a more
systematic and reliable difference between text types than between
authors or texts, but lexical diversity is so variable that even this
doesn't seem very likely. For more detail , see my
“Another Perspective on Vocabulary Richness.” Computers and the
Humanities, 37(2), 2003: 151-78.
Best,
David Hoover
On 11/20/2011 1:00 PM, Georgios Mikros wrote:
>
> Dear Chris,
>
> First things first. TTR is highly dependent to text length so you have
> to be sure that the measurements have been taken from equal size text
> samples. Otherwise you should use a more robust index such as Yule’s K
> or Zipf’s Z (see the [1] for a detailed description of this problem).
> Now coming to your original question, TTR is a continuous variable and
> you could use the whole range of parametric statistics. This means
> that you can use a t-test if you want to check whether TTR is
> significant different across two classes (e.g. Gender distinction in
> essays), or ANOVA if your independent variable has many classes (e.g.
> Text Genre, Text Topic etc). You can also implement a linear
> regression model with dependent variable TTR and independent variables
> the ones that describe your research hypothesis. In all the above
> cases you need multiple TTR measurements because inferential
> statistics are based on the distribution parameters of the TTR. There
> is also the option to compare a single TTR value to a distribution of
> TTR values using one-sample location test (also called Z test) which
> actually can tell you how the specific TTR value lies away from the
> mean of the TTRs.
>
> If the only thing you know are just 2 TTR values I don’t think you can
> compare them in any meaningful way.
>
> Best
>
> George Mikros
>
> [1] Tweedie, Fiona J., & Baayen, Harald R. (1998). How variable may a
> constant be? Measures of lexical richness in perspective. Computers
> and the Humanities, 32(5), 323-352.
>
> ____________________________
>
> George K. Mikros
>
> Associate Professor of Computational and Quantitative Linguistics
>
> Department of Italian Language and Literature
>
> School of Philosophy
>
> National and Kapodistrian University of Athens
>
> Panepistimioupoli Zografou, GR-15784
>
> Athens, Greece
>
> Tel: +30 210 7277491, +30 6976111742
>
> Email: gmikros at isll.uoa.gr <mailto:gmikros at isll.uoa.gr>
>
> Web: http://users.uoa.gr/~gmikros/ <http://users.uoa.gr/%7Egmikros/>
>
> *From:*corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] *On
> Behalf Of *CRuehlemann at aol.com
> *Sent:* Sunday, November 20, 2011 7:21 PM
> *To:* CORPORA at uib.no
> *Subject:* [Corpora-List] Significance test for TTR
>
> Hi all,
>
> The type token ratio (TTR) is a measure of the lexical diversity of a
> text/text type. If one finds in two texts/text types two widely
> differing TTRs, one would like to assess the significance of this finding.
>
> Which test is appropriate for differences between TTRs?
>
> Best
>
> Chris
>
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list