[Lexicog] agreed-upon minimum size for lexicographic corpora

maxwell maxwell@umiacs.umd.edu [lexicographylist] lexicographylist at yahoogroups.com
Tue Jul 5 00:49:06 EDT 2016

On 6/20/2016 6:36 AM, Tsvi Sadan tsvi.sadan at gmail.com [lexicographylist] 
> I wonder if there is any agreed-upon minimum size for lexicographic
> corpora though, for example, Atkins and Rundell write that there is no
> such definitive minimum size in their "Oxford Guide to Practical
> Lexicography" (2008: 61).

It depends on the language, what size dictionary you want, how finely 
you want to split senses, and a host of other decisions about the 
dictionary.  In other words, the corpus size is a dependent variable, 
not an independent variable.

For example, suppose you want a 10,000 lemma (word) dictionary.  Then 
you keep expanding your corpus until you have 10,000 lemmas (after 
throwing out things like proper nouns, place names, code switched 
words--if you don't want such things in your dictionary).

Of course having only a single instance of some lemmas is probably not 
sufficient for lexicographic purposes; you may want some number, say 10, 
instances of a lemma.  In that case you'd keep expanding your corpus 
until you had 10,000 lemmas with a (stemmed) token frequency of >= 10.

Naturally, you may only find a limited size corpus for some languages, 
and in that case the corpus size will be the limiting factor on what 
kind of dictionary you can create with it.

    Mike Maxwell
    University of Maryland

Posted by: maxwell <maxwell at umiacs.umd.edu>


Yahoo Groups Links

<*> To visit your group on the web, go to:

<*> Your email settings:
    Individual Email | Traditional

<*> To change settings online go to:
    (Yahoo! ID required)

<*> To change settings via email:
    lexicographylist-digest at yahoogroups.com 
    lexicographylist-fullfeatured at yahoogroups.com

<*> To unsubscribe from this group, send an email to:
    lexicographylist-unsubscribe at yahoogroups.com

<*> Your use of Yahoo Groups is subject to:

More information about the Lexicography mailing list