[Corpora-List] Quantifying lexical diversity of (corpus-derived) word lists

Jeff Elmore jelmore at lexile.com
Wed Apr 10 17:38:45 UTC 2013


I'm not totally clear on whether you would be using corpora in the
traditional sense, or using these word lists as corpora. But either way you
might want to check out this book: Word Frequency Distributions by Baayen:
http://books.google.com/books/about/Word_Frequency_Distributions.html?id=xUSM69ZkjHoC

Comparing word frequency measures across corpora of different sizes is rife
with complexity. Baayen goes into great detail from the ground-up
describing the issues with modelling word frequency distributions (which
are at the heart of lexical diversity measures).

He also talks about issues specifically related to quantifying lexical
diversity. Measures such as type/token ratios are incredibly dependent upon
sample size, so comparing them across corpora of different sizes is
difficult to interpret if not simply meaningless.

He proposes a few adjustments that do help and there are other techniques
that have been proposed such as vocd (
http://ltj.sagepub.com/content/19/1/85.short). However it seems like every
time someone proposes a new technique, someone else shows how it still does
not satisfactorally address issues related to sample-size-dependence. For
vocd, here is such a paper: http://ltj.sagepub.com/content/24/4/459.abstract

Overall I think there is, as yet, no simple solution to the problem of
sample size dependence. However, here is a link to a new technique called
MTLD that claims to solve it:
http://link.springer.com/article/10.3758/BRM.42.2.381

I haven't read the paper or tried MTLD, so I couldn't say how effective it
is. They claim that it is not dependent upon sample size. Probably someone
soon will write a paper explaining why it is dependent on sample size (stay
tuned!)




On Mon, Apr 8, 2013 at 5:33 PM, Marko, Georg (georg.marko at uni-graz.at) <
georg.marko at uni-graz.at> wrote:

> Dear corpus linguists,
>
> I’m almost a tabula rasa when it comes to statistics so please excuse me
> if the following question is complete nonsense.
>
> But there has been a problem that has been bothering me concerning the
> quantification of the lexical diversity (or lexical variation) in lists
> derived from corpora. Theoretically, these lists could be of any kind,
> formally or semantically defined. The idea is to compare different lists
> from one corpus or the same lists across different corpora with respect to
> how prominent the categories the lists represent are in a particular text,
> in a particular text type, discourse, genre, etc.
>
> Token frequencies are the obvious starting point for quantifying this,
> assuming that if words from one list occur more often than those from
> another the former category will be more prominent (leaving aside the
> question what ‘prominence’ now means cognitively and/or socially).
>
> But lexical diversity* would be another as the status of a list of two
> lexemes occurring 50 times each (e.g. a list of pathonyms containing
> ‘disease’ and ‘illness’) is probably different from one of 25 lexemes
> occurring 4 times each on average (e.g. a list of pathonyms containing
> ‘cardiovascular disease’, ‘heart disease’, ‘coronary heart disease’, ‘heart
> failure’, ‘myocardial infarction’, ‘tachycardia’, ‘essential
> hypertension’…).
>
> The easiest way to quantify this would to take the number of different
> types/lexemes in the list. This seems fine intuitively, even though I’m not
> sure to what extent I should be looking for a measure that is less
> dependent on token frequencies (obviously, there is usually a correlation
> between type and token frequencies). Type-token ratios could be another
> candidate, but it is the converse situation, with small lists showing
> higher values than larger lists.
>
> So I guess, my question is whether there is any (perhaps even established
> *embarrassment*) measure that would represent lexical diversity better.
>
> Maybe it all depends on what I mean by lexical diversity and by clarifying
> this I would avoid the problem at the other end of the analysis. However,
> if anyone knows, I would be grateful to learn.
>
> Thank you
>
> Best regards
>
>
>
> Georg Marko
>
>
>
> *There is a relation to the concept of “overlexicalization” or
> “overwording” used in Critical Discourse Analysis, which assumes that the
> use of many different lexemes for the same concept, similar or related
> concepts points to a certain preoccupation with an idea or set of ideas.
> The problem here is of course ‘over’ and the question of an implicitly
> assumed standard of lexicalization.
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20130410/12d3affa/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list