[Corpora-List] Quantifying lexical diversity of (corpus-derived) word lists

Jim Fidelholtz fidelholtz at gmail.com
Tue Apr 9 23:07:43 UTC 2013


Hi, Marko,

I should start by saying that I'm a 'trained scientist' (somewhat akin to a
trained seal, I suppose), with degrees in Math and Linguistics from a (we
would say 'the') top university in both fields. Aside from a little Army
training (using, as I recall from long ago, Hoël), I've never actually
taken a course in Statistics (btw, for me the 'experts' in statistics
generally are not mathematicians, but rather experimental psychologists,
who have quite different foibles from 'us ['hard'] scientists', but do know
their statistics). In addition, my only published use of statistics was the
Chi-squared test (it was, though, appropriate for the use I gave it).

That said (whew!), you should not think your questions are embarrassing in
the slightest; they reflect what seem to me to be common concerns in the
(often uninformed) use of statistics in general. My use of statistics was
in the pre-computer days of the 70s (perhaps I used a slide rule, as there
were no handy-dandy calculators either, or else I did it manually), but now
there are easily available programs to churn out in seconds or minutes what
used to take, literally, days on computers (speeds then, the fast ones, of
1 *Kilo*hertz or so; sigh). We'll soon be approaching Terahertz speeds
(that's 10 to the ninth power faster!), almost certainly. One important
point is to *read up on* each type of test you are considering using,
especially the *limitations* on the use of each test. (For example, for the
Chi Squared test, none of the cells should have over about 100 entries;
otherwise, you are practically *guaranteed* statistical significance, since
this test basically involves the ratio of N squared divided by N, to
oversimplify.) Each statistical test has its own limitations, but the more
sensitive a test is, in general, the more complicated it is to use and
understand, and often its limitations will not fit what you are trying to
do. (Also, it is more likely to take longer to run, but nowadays that is
much less significant.) Simply put, *every* statistical test should *not*
be used to analyze certain types of data, which will depend on the type of
test in question. I'm sure there are handy-dandy guides on the Web in
tabular form to help in choosing which test to use, and which ones to avoid
for specific purposes. Likewise, some tests are more sensitive than others.

I hope these comments are of some use for you. I'm sure other list members
know a lot more than I do about specific tests, and perhaps can guide you
to references you can use to decide which tests to use for your purposes.
Good luck.

Jim


On Mon, Apr 8, 2013 at 4:33 PM, Marko, Georg (georg.marko at uni-graz.at) <
georg.marko at uni-graz.at> wrote:

> Dear corpus linguists,
>
> I’m almost a tabula rasa when it comes to statistics so please excuse me
> if the following question is complete nonsense.
>
> But there has been a problem that has been bothering me concerning the
> quantification of the lexical diversity (or lexical variation) in lists
> derived from corpora. Theoretically, these lists could be of any kind,
> formally or semantically defined. The idea is to compare different lists
> from one corpus or the same lists across different corpora with respect to
> how prominent the categories the lists represent are in a particular text,
> in a particular text type, discourse, genre, etc.
>
> Token frequencies are the obvious starting point for quantifying this,
> assuming that if words from one list occur more often than those from
> another the former category will be more prominent (leaving aside the
> question what ‘prominence’ now means cognitively and/or socially).
>
> But lexical diversity* would be another as the status of a list of two
> lexemes occurring 50 times each (e.g. a list of pathonyms containing
> ‘disease’ and ‘illness’) is probably different from one of 25 lexemes
> occurring 4 times each on average (e.g. a list of pathonyms containing
> ‘cardiovascular disease’, ‘heart disease’, ‘coronary heart disease’, ‘heart
> failure’, ‘myocardial infarction’, ‘tachycardia’, ‘essential
> hypertension’…).
>
> The easiest way to quantify this would to take the number of different
> types/lexemes in the list. This seems fine intuitively, even though I’m not
> sure to what extent I should be looking for a measure that is less
> dependent on token frequencies (obviously, there is usually a correlation
> between type and token frequencies). Type-token ratios could be another
> candidate, but it is the converse situation, with small lists showing
> higher values than larger lists.
>
> So I guess, my question is whether there is any (perhaps even established
> *embarrassment*) measure that would represent lexical diversity better.
>
> Maybe it all depends on what I mean by lexical diversity and by clarifying
> this I would avoid the problem at the other end of the analysis. However,
> if anyone knows, I would be grateful to learn.
>
> Thank you
>
> Best regards
>
>
>
> Georg Marko
>
>
>
> *There is a relation to the concept of “overlexicalization” or
> “overwording” used in Critical Discourse Analysis, which assumes that the
> use of many different lexemes for the same concept, similar or related
> concepts points to a certain preoccupation with an idea or set of ideas.
> The problem here is of course ‘over’ and the question of an implicitly
> assumed standard of lexicalization.
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>



-- 
James L. Fidelholtz
Posgrado en Ciencias del Lenguaje
Instituto de Ciencias Sociales y Humanidades
Benemérita Universidad Autónoma de Puebla, MÉXICO
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20130409/d62add8e/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list