<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head><meta http-equiv=Content-Type content="text/html; charset=us-ascii"><meta name=Generator content="Microsoft Word 12 (filtered medium)"><style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
{font-family:Tahoma;
panose-1:2 11 6 4 3 5 4 4 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0cm;
margin-bottom:.0001pt;
font-size:12.0pt;
font-family:"Times New Roman","serif";}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:blue;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{mso-style-priority:99;
color:purple;
text-decoration:underline;}
span.EmailStyle17
{mso-style-type:personal-reply;
font-family:"Calibri","sans-serif";
color:#1F497D;}
.MsoChpDefault
{mso-style-type:export-only;}
@page WordSection1
{size:612.0pt 792.0pt;
margin:70.85pt 70.85pt 70.85pt 70.85pt;}
div.WordSection1
{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]--></head><body lang=ET link=blue vlink=purple><div class=WordSection1><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>Adam, although I like the clarity of your statement, but when you are working on productivity issues, like want to determine whether an affix is productive, or why do new words join a certain inflectional class, it is the type and token ratio that you should be looking at. It is one of the few meaningful numbers...<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p> </o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>Just couldn’t resist,<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>Heiki Kaalep <o:p></o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p> </o:p></span></p><div style='border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0cm 0cm 0cm'><p class=MsoNormal><b><span lang=EN-US style='font-size:10.0pt;font-family:"Tahoma","sans-serif"'>From:</span></b><span lang=EN-US style='font-size:10.0pt;font-family:"Tahoma","sans-serif"'> corpora-bounces@uib.no [mailto:corpora-bounces@uib.no] <b>On Behalf Of </b>Adam Kilgarriff<br><b>Sent:</b> Wednesday, April 10, 2013 10:48 PM<br><b>To:</b> Jeff Elmore<br><b>Cc:</b> CORPORA@UIB.NO<br><b>Subject:</b> Re: [Corpora-List] Quantifying lexical diversity of (corpus-derived) word lists<o:p></o:p></span></p></div><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal>Jeff, Georg and all,<o:p></o:p></p><div><p class=MsoNormal><o:p> </o:p></p></div><div><p class=MsoNormal>I'd say - beware. I know Baayen's book - I've tried to read it and failed dismally - but I've also worked out that it doesn't matter because what Baayen is trying to do is 'rescue' the type-token ratio, which is doomed to vary with text length, replacing it by something similar but cleverer that is not text-length dependent.<o:p></o:p></p></div><div><p class=MsoNormal><o:p> </o:p></p></div><div><p class=MsoNormal>Type-token ratios drive me spare. They often get given in papers for no better reason than that other authors give them (and some corpus software provides them for free). They never lead anywhere. They are meaningless. The number of types in a corpus depends on factors like whether the writer used a spellchecker, whether it is copy-edited, how objects like Yahoo! and [blah-de-blah] and :-) and zzzzzzzzzz are handled, and many many others of little or no linguistic interest yet highly variable between datasets. For any decent-sized corpus there will be tens of thousands of types, half of which will occur only once. (BNC: 800,000 types) To work your way through a list like that would be staggeringly boring and a complete waste of time. So - any stat based on numbers of types: useless.<o:p></o:p></p></div><div><p class=MsoNormal><o:p> </o:p></p></div><div><p class=MsoNormal>Adam<o:p></o:p></p></div><div><p class=MsoNormal><o:p> </o:p></p><div><p class=MsoNormal>On 10 April 2013 18:38, Jeff Elmore <<a href="mailto:jelmore@lexile.com" target="_blank">jelmore@lexile.com</a>> wrote:<o:p></o:p></p><div><div><p class=MsoNormal>I'm not totally clear on whether you would be using corpora in the traditional sense, or using these word lists as corpora. But either way you might want to check out this book: Word Frequency Distributions by Baayen:<o:p></o:p></p></div><p class=MsoNormal><a href="http://books.google.com/books/about/Word_Frequency_Distributions.html?id=xUSM69ZkjHoC" target="_blank">http://books.google.com/books/about/Word_Frequency_Distributions.html?id=xUSM69ZkjHoC</a><o:p></o:p></p><div><p class=MsoNormal><o:p> </o:p></p></div><div><p class=MsoNormal>Comparing word frequency measures across corpora of different sizes is rife with complexity. Baayen goes into great detail from the ground-up describing the issues with modelling word frequency distributions (which are at the heart of lexical diversity measures).<o:p></o:p></p></div><div><p class=MsoNormal><o:p> </o:p></p></div><div><p class=MsoNormal>He also talks about issues specifically related to quantifying lexical diversity. Measures such as type/token ratios are incredibly dependent upon sample size, so comparing them across corpora of different sizes is difficult to interpret if not simply meaningless.<o:p></o:p></p></div><div><p class=MsoNormal><o:p> </o:p></p></div><div><p class=MsoNormal>He proposes a few adjustments that do help and there are other techniques that have been proposed such as vocd (<a href="http://ltj.sagepub.com/content/19/1/85.short" target="_blank">http://ltj.sagepub.com/content/19/1/85.short</a>). However it seems like every time someone proposes a new technique, someone else shows how it still does not satisfactorally address issues related to sample-size-dependence. For vocd, here is such a paper: <a href="http://ltj.sagepub.com/content/24/4/459.abstract" target="_blank">http://ltj.sagepub.com/content/24/4/459.abstract</a><o:p></o:p></p></div><div><p class=MsoNormal><o:p> </o:p></p></div><div><p class=MsoNormal>Overall I think there is, as yet, no simple solution to the problem of sample size dependence. However, here is a link to a new technique called MTLD that claims to solve it: <a href="http://link.springer.com/article/10.3758/BRM.42.2.381" target="_blank">http://link.springer.com/article/10.3758/BRM.42.2.381</a><o:p></o:p></p></div><div><p class=MsoNormal><o:p> </o:p></p></div><div><p class=MsoNormal>I haven't read the paper or tried MTLD, so I couldn't say how effective it is. They claim that it is not dependent upon sample size. Probably someone soon will write a paper explaining why it is dependent on sample size (stay tuned!)<o:p></o:p></p></div><div><p class=MsoNormal><o:p> </o:p></p></div><div><p class=MsoNormal><o:p> </o:p></p></div></div><div><p class=MsoNormal style='margin-bottom:12.0pt'><o:p> </o:p></p><div><div><p class=MsoNormal>On Mon, Apr 8, 2013 at 5:33 PM, Marko, Georg (<a href="mailto:georg.marko@uni-graz.at" target="_blank">georg.marko@uni-graz.at</a>) <<a href="mailto:georg.marko@uni-graz.at" target="_blank">georg.marko@uni-graz.at</a>> wrote:<o:p></o:p></p></div><div><div><blockquote style='border:none;border-left:solid #CCCCCC 1.0pt;padding:0cm 0cm 0cm 6.0pt;margin-left:4.8pt;margin-right:0cm'><p class=MsoNormal>Dear corpus linguists,<br><br>I’m almost a tabula rasa when it comes to statistics so please excuse me if the following question is complete nonsense.<br><br>But there has been a problem that has been bothering me concerning the quantification of the lexical diversity (or lexical variation) in lists derived from corpora. Theoretically, these lists could be of any kind, formally or semantically defined. The idea is to compare different lists from one corpus or the same lists across different corpora with respect to how prominent the categories the lists represent are in a particular text, in a particular text type, discourse, genre, etc.<br><br>Token frequencies are the obvious starting point for quantifying this, assuming that if words from one list occur more often than those from another the former category will be more prominent (leaving aside the question what ‘prominence’ now means cognitively and/or socially).<br><br>But lexical diversity* would be another as the status of a list of two lexemes occurring 50 times each (e.g. a list of pathonyms containing ‘disease’ and ‘illness’) is probably different from one of 25 lexemes occurring 4 times each on average (e.g. a list of pathonyms containing ‘cardiovascular disease’, ‘heart disease’, ‘coronary heart disease’, ‘heart failure’, ‘myocardial infarction’, ‘tachycardia’, ‘essential hypertension’…).<br><br>The easiest way to quantify this would to take the number of different types/lexemes in the list. This seems fine intuitively, even though I’m not sure to what extent I should be looking for a measure that is less dependent on token frequencies (obviously, there is usually a correlation between type and token frequencies). Type-token ratios could be another candidate, but it is the converse situation, with small lists showing higher values than larger lists.<br><br>So I guess, my question is whether there is any (perhaps even established *embarrassment*) measure that would represent lexical diversity better.<br><br>Maybe it all depends on what I mean by lexical diversity and by clarifying this I would avoid the problem at the other end of the analysis. However, if anyone knows, I would be grateful to learn.<br><br>Thank you<br><br>Best regards<br><br><br><br>Georg Marko<br><br><br><br>*There is a relation to the concept of “overlexicalization” or “overwording” used in Critical Discourse Analysis, which assumes that the use of many different lexemes for the same concept, similar or related concepts points to a certain preoccupation with an idea or set of ideas. The problem here is of course ‘over’ and the question of an implicitly assumed standard of lexicalization.<br><br>_______________________________________________<br>UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/corpora</a><br>Corpora mailing list<br><a href="mailto:Corpora@uib.no" target="_blank">Corpora@uib.no</a><br><a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><o:p></o:p></p></blockquote></div></div></div><p class=MsoNormal><o:p> </o:p></p></div><p class=MsoNormal style='margin-bottom:12.0pt'><br>_______________________________________________<br>UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/corpora</a><br>Corpora mailing list<br><a href="mailto:Corpora@uib.no" target="_blank">Corpora@uib.no</a><br><a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><o:p></o:p></p></div><p class=MsoNormal><br><br clear=all><o:p></o:p></p><div><p class=MsoNormal><o:p> </o:p></p></div><p class=MsoNormal>-- <br>========================================<br><a href="http://www.kilgarriff.co.uk/" target="_blank">Adam Kilgarriff</a> <a href="mailto:adam@lexmasterclass.com" target="_blank">adam@lexmasterclass.com</a> <br>Director <a href="http://www.sketchengine.co.uk/" target="_blank">Lexical Computing Ltd</a> <br>Visiting Research Fellow <a href="http://leeds.ac.uk" target="_blank">University of Leeds</a> <o:p></o:p></p><div><p class=MsoNormal><i><span style='color:#006600'>Corpora for all</span></i> with <a href="http://www.sketchengine.co.uk" target="_blank">the Sketch Engine</a> <o:p></o:p></p></div><div><p class=MsoNormal> <i><a href="http://www.webdante.com" target="_blank">DANTE: <span style='color:#009900'>a lexical database for English</span></a><span style='color:#009900'> </span> </i><o:p></o:p></p><div><p class=MsoNormal>========================================<o:p></o:p></p></div></div></div></div></body></html>