<html><head></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><div>Hi Chris,</div><div><br></div><div>LD measures can be quite robust indicators of lexical use in many areas of research. For the State of the Art in LD research read this paper</div><div><br></div><div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 10px/normal 'Times New Roman'; "><b>McCarthy, P.M. </b>& Jarvis, S., (2010). MTLD, vocd-D, and HD-D: A validation study of sophisticated approaches to lexical diversity Assessment". <i>Behavior Research Methods</i>. 42:381-392. </div></div><div><br></div><div>Which can be found here for free</div><div><br></div><div><a href="http://www.springerlink.com/content/257587jm46601751/">http://www.springerlink.com/content/257587jm46601751/</a></div><div><br></div><div>One LD index used by McCarthy and Jarvis (mtld) appears to not be text length dependent; however, there are a variety of other indices that might suit your needs also (d and m come to mind). Tools to compute these indices automatically are available and free and one of these tools (the Grammulator) will even compute statistical analyses for you (I think).</div><div><br></div><div>But do read the paper first ;)</div><div><br></div><div>Scott</div><div><br></div><br><div><div>On Nov 21, 2011, at 5:20 AM, Benjamin Allison wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><div>Chris,<br><br>I will assume you have good reason for asking about vocabulary richness measures, mindful of the fact (as David and others point out) that vocabulary richness might not be all that useful, and concentrate on the technical question.<br><br>TTR is a bad idea to use on at least two counts: one technical and the other practical. The practical concern is that it's highly dependent on text length, as others point out, so you'd have to use the same size sample which would mean throwing out some data (always a bad place to start with statistics!). The second is that there's no reason to believe it has any kind of distribution in particular, so you'd end up just using something out of the box which wouldn't be appropriate (like the t-test).<br><br>A better measure, stable across text lengths in my experience, is the proportion of word token pairs in the text which are the same word, i.e. if your corpus were:<br><br>w_1 w_2 w_3 w_1 w_2<br><br>there are 10 pairs:<br><br>w_1 w_2<br>w_1 w_3<br>w_1 w_1 *<br>w_1 w_2<br>w_2 w_3<br>w_2 w_1<br>w_2 w_2 *<br>w_3 w_1<br>w_3 w_2<br>w_1 w_2<br><br>of which two are the same word. This can be viewed as a binomial parameter, where the ML estimator in this case would be 2/10, and so if you have two corpora you wish to compare you're looking to compare you can use a test for comparing binomial parameters.<br><br>There are lots of ways to go about this, but most people will suggest using a normal approximation to the binomial and then testing for different means. Beware if the sample sizes are very different, the assumption of equal variance will not hold. There are methods for testing binomial populations directly, but they're a bit more involved.<br><br>If you're tempted to go down the road of assuming normality, I'd suggest the logistic transform of the parameters first (since your estimated parameters will be close to 0, which is where normal approximations break down), and if I recall correctly the approximation gets better still if you take the difference of your (transformed) parameters and test for a mean of this significantly different to zero.<br><br>One final word of caution--most tests will be assuming independent samples, and this will not hold in this case, and here's why. If you observe in one sample a value of, say, 0.00001, this tells you something about the likely distribution of the statistic in the other sample (for example, 0.99999 would be a pretty unlikely outcome...). A more realistic scenario would be where the values of the statistics in both samples are drawn from some common, underlying prior, but that's probably going to get quite involved and I'm not sure how accurate you want the test to be. In any case, the effect that this will have (just so you're aware) is that you may judge your two samples to be from the same population when in fact they're different, it's just that the range of possible values is far narrower than you're allowing.<br><br>Hope that helps.<br><br>B<br><br>Quoting "David L. Hoover" <<a href="mailto:david.hoover@nyu.edu">david.hoover@nyu.edu</a>> on Sun, 20 Nov 2011 22:43:54 -0500:<br><br><blockquote type="cite">Dear Chris,<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">George has given a good explanation of some of the problems. A much more severe problem is that lexical diversity/vocabulary richness is simply not a very reliable statistic for differentiating texts/authors. Although Tweedie and Baayen conclude that it can be used with caution, my own research has shown that lexical diversity shows extreme fluctuation within the works of a single author and even between different sections of the same text. Perhaps there might be a more systematic and reliable difference between text types than between authors or texts, but lexical diversity is so variable that even this doesn't seem very likely. For more detail , see my<br></blockquote><blockquote type="cite">“Another Perspective on Vocabulary Richness.” Computers and the Humanities, 37(2), 2003: 151-78.<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">Best,<br></blockquote><blockquote type="cite">David Hoover<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">On 11/20/2011 1:00 PM, Georgios Mikros wrote:<br></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">Dear Chris,<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">First things first. TTR is highly dependent to text length so you have to be sure that the measurements have been taken from equal size text samples. Otherwise you should use a more robust index such as Yule’s K or Zipf’s Z (see the [1] for a detailed description of this problem). Now coming to your original question, TTR is a continuous variable and you could use the whole range of parametric statistics. This means that you can use a t-test if you want to check whether TTR is significant different across two classes (e.g. Gender distinction in essays), or ANOVA if your independent variable has many classes (e.g. Text Genre, Text Topic etc). You can also implement a linear regression model with dependent variable TTR and independent variables the ones that describe your research hypothesis. In all the above cases you need multiple TTR measurements because inferential statistics are based on the distribution parameters of the TTR. There is also the option to compare a single TTR value to a distribution of TTR values using one-sample location test (also called Z test) which actually can tell you how the specific TTR value lies away from the mean of the TTRs.<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">If the only thing you know are just 2 TTR values I don’t think you can compare them in any meaningful way.<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">Best<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">George Mikros<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">[1] Tweedie, Fiona J., & Baayen, Harald R. (1998). How variable may a constant be? Measures of lexical richness in perspective. Computers and the Humanities, 32(5), 323-352.<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">____________________________<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">George K. Mikros<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">Associate Professor of Computational and Quantitative Linguistics<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">Department of Italian Language and Literature<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">School of Philosophy<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">National and Kapodistrian University of Athens<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">Panepistimioupoli Zografou, GR-15784<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">Athens, Greece<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">Tel: +30 210 7277491, +30 6976111742<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">Email: <a href="mailto:gmikros@isll.uoa.gr">gmikros@isll.uoa.gr</a> <<a href="mailto:gmikros@isll.uoa.gr">mailto:gmikros@isll.uoa.gr</a>><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">Web: <a href="http://users.uoa.gr/~gmikros/">http://users.uoa.gr/~gmikros/</a> <<a href="http://users.uoa.gr/%7Egmikros/">http://users.uoa.gr/%7Egmikros/</a>><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">*From:<a href="mailto:*corpora-bounces@uib.no">*corpora-bounces@uib.no</a> [mailto:corpora-bounces@uib.no] *On Behalf Of <a href="mailto:*CRuehlemann@aol.com">*CRuehlemann@aol.com</a><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">*Sent:* Sunday, November 20, 2011 7:21 PM<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">*To:* <a href="mailto:CORPORA@uib.no">CORPORA@uib.no</a><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">*Subject:* [Corpora-List] Significance test for TTR<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">Hi all,<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">The type token ratio (TTR) is a measure of the lexical diversity of a text/text type. If one finds in two texts/text types two widely differing TTRs, one would like to assess the significance of this finding.<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">Which test is appropriate for differences between TTRs?<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">Best<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">Chris<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">_______________________________________________<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora">http://mailman.uib.no/options/corpora</a><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">Corpora mailing list<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><a href="mailto:Corpora@uib.no">Corpora@uib.no</a><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><a href="http://mailman.uib.no/listinfo/corpora">http://mailman.uib.no/listinfo/corpora</a><br></blockquote></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">_______________________________________________<br></blockquote><blockquote type="cite">UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora">http://mailman.uib.no/options/corpora</a><br></blockquote><blockquote type="cite">Corpora mailing list<br></blockquote><blockquote type="cite"><a href="mailto:Corpora@uib.no">Corpora@uib.no</a><br></blockquote><blockquote type="cite"><a href="http://mailman.uib.no/listinfo/corpora">http://mailman.uib.no/listinfo/corpora</a><br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite"><br></blockquote><br><br><br>-- <br>The University of Edinburgh is a charitable body, registered in<br>Scotland, with registration number SC005336.<br><br><br><br>_______________________________________________<br>UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora">http://mailman.uib.no/options/corpora</a><br>Corpora mailing list<br><a href="mailto:Corpora@uib.no">Corpora@uib.no</a><br>http://mailman.uib.no/listinfo/corpora<br></div></blockquote></div><br><div>
<div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><span class="Apple-style-span" style="orphans: 2; text-indent: 0px; widows: 2; -webkit-text-decorations-in-effect: none; "><div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><div style="border-collapse: separate; color: rgb(0, 0, 0); font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-border-horizontal-spacing: 0px; -webkit-border-vertical-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; ">Scott Crossley, Ph.D.<br>Department of Applied Linguistics/ESL<br>Georgia State University<br></div><div><a href="http://www2.gsu.edu/~wwwesl/scottcrossleybio.html">http://www2.gsu.edu/~wwwesl/scottcrossleybio.html</a></div></div></span></div></div>
</div>
<br></body></html>