[Corpora-List] Significance test for TTR
Benjamin Allison
ballison at staffmail.ed.ac.uk
Mon Nov 21 10:20:50 UTC 2011
Chris,
I will assume you have good reason for asking about vocabulary
richness measures, mindful of the fact (as David and others point out)
that vocabulary richness might not be all that useful, and concentrate
on the technical question.
TTR is a bad idea to use on at least two counts: one technical and the
other practical. The practical concern is that it's highly dependent
on text length, as others point out, so you'd have to use the same
size sample which would mean throwing out some data (always a bad
place to start with statistics!). The second is that there's no reason
to believe it has any kind of distribution in particular, so you'd end
up just using something out of the box which wouldn't be appropriate
(like the t-test).
A better measure, stable across text lengths in my experience, is the
proportion of word token pairs in the text which are the same word,
i.e. if your corpus were:
w_1 w_2 w_3 w_1 w_2
there are 10 pairs:
w_1 w_2
w_1 w_3
w_1 w_1 *
w_1 w_2
w_2 w_3
w_2 w_1
w_2 w_2 *
w_3 w_1
w_3 w_2
w_1 w_2
of which two are the same word. This can be viewed as a binomial
parameter, where the ML estimator in this case would be 2/10, and so
if you have two corpora you wish to compare you're looking to compare
you can use a test for comparing binomial parameters.
There are lots of ways to go about this, but most people will suggest
using a normal approximation to the binomial and then testing for
different means. Beware if the sample sizes are very different, the
assumption of equal variance will not hold. There are methods for
testing binomial populations directly, but they're a bit more involved.
If you're tempted to go down the road of assuming normality, I'd
suggest the logistic transform of the parameters first (since your
estimated parameters will be close to 0, which is where normal
approximations break down), and if I recall correctly the
approximation gets better still if you take the difference of your
(transformed) parameters and test for a mean of this significantly
different to zero.
One final word of caution--most tests will be assuming independent
samples, and this will not hold in this case, and here's why. If you
observe in one sample a value of, say, 0.00001, this tells you
something about the likely distribution of the statistic in the other
sample (for example, 0.99999 would be a pretty unlikely outcome...). A
more realistic scenario would be where the values of the statistics in
both samples are drawn from some common, underlying prior, but that's
probably going to get quite involved and I'm not sure how accurate you
want the test to be. In any case, the effect that this will have (just
so you're aware) is that you may judge your two samples to be from the
same population when in fact they're different, it's just that the
range of possible values is far narrower than you're allowing.
Hope that helps.
B
Quoting "David L. Hoover" <david.hoover at nyu.edu> on Sun, 20 Nov 2011
22:43:54 -0500:
> Dear Chris,
>
> George has given a good explanation of some of the problems. A much
> more severe problem is that lexical diversity/vocabulary richness is
> simply not a very reliable statistic for differentiating
> texts/authors. Although Tweedie and Baayen conclude that it can be
> used with caution, my own research has shown that lexical diversity
> shows extreme fluctuation within the works of a single author and
> even between different sections of the same text. Perhaps there
> might be a more systematic and reliable difference between text
> types than between authors or texts, but lexical diversity is so
> variable that even this doesn't seem very likely. For more detail ,
> see my
> “Another Perspective on Vocabulary Richness.” Computers and the
> Humanities, 37(2), 2003: 151-78.
>
> Best,
> David Hoover
>
> On 11/20/2011 1:00 PM, Georgios Mikros wrote:
>>
>> Dear Chris,
>>
>> First things first. TTR is highly dependent to text length so you
>> have to be sure that the measurements have been taken from equal
>> size text samples. Otherwise you should use a more robust index
>> such as Yule’s K or Zipf’s Z (see the [1] for a detailed
>> description of this problem). Now coming to your original question,
>> TTR is a continuous variable and you could use the whole range of
>> parametric statistics. This means that you can use a t-test if you
>> want to check whether TTR is significant different across two
>> classes (e.g. Gender distinction in essays), or ANOVA if your
>> independent variable has many classes (e.g. Text Genre, Text Topic
>> etc). You can also implement a linear regression model with
>> dependent variable TTR and independent variables the ones that
>> describe your research hypothesis. In all the above cases you need
>> multiple TTR measurements because inferential statistics are based
>> on the distribution parameters of the TTR. There is also the option
>> to compare a single TTR value to a distribution of TTR values using
>> one-sample location test (also called Z test) which actually can
>> tell you how the specific TTR value lies away from the mean of the
>> TTRs.
>>
>> If the only thing you know are just 2 TTR values I don’t think you
>> can compare them in any meaningful way.
>>
>> Best
>>
>> George Mikros
>>
>> [1] Tweedie, Fiona J., & Baayen, Harald R. (1998). How variable may
>> a constant be? Measures of lexical richness in perspective.
>> Computers and the Humanities, 32(5), 323-352.
>>
>> ____________________________
>>
>> George K. Mikros
>>
>> Associate Professor of Computational and Quantitative Linguistics
>>
>> Department of Italian Language and Literature
>>
>> School of Philosophy
>>
>> National and Kapodistrian University of Athens
>>
>> Panepistimioupoli Zografou, GR-15784
>>
>> Athens, Greece
>>
>> Tel: +30 210 7277491, +30 6976111742
>>
>> Email: gmikros at isll.uoa.gr <mailto:gmikros at isll.uoa.gr>
>>
>> Web: http://users.uoa.gr/~gmikros/ <http://users.uoa.gr/%7Egmikros/>
>>
>> *From:*corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] *On
>> Behalf Of *CRuehlemann at aol.com
>> *Sent:* Sunday, November 20, 2011 7:21 PM
>> *To:* CORPORA at uib.no
>> *Subject:* [Corpora-List] Significance test for TTR
>>
>> Hi all,
>>
>> The type token ratio (TTR) is a measure of the lexical diversity of
>> a text/text type. If one finds in two texts/text types two widely
>> differing TTRs, one would like to assess the significance of this
>> finding.
>>
>> Which test is appropriate for differences between TTRs?
>>
>> Best
>>
>> Chris
>>
>>
>>
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list