[Lexicog] agreed-upon minimum size for lexicographic corpora

Doug Cooper doug.cooper.thailand@gmail.com [lexicographylist] lexicographylist at yahoogroups.com
Wed Jul 6 02:11:28 EDT 2016

There's a recent paper that may shed some light on a data-driven approach
to finding an appropriate corpus size:

The dynamics of correlated novelties (2013)
F. Tria, V. Loreto, V.D.P. Servedio, S.H. Strogatz

They look at the rates of changes in the curves that describe:
   - appearance of new lexical items (akin to Heaps' law)
   - relative frequencies of all items (akin to Zipf's law)

If I understand the fine print, the closer the (appropriately manipulated)
curves are to flattening out and meeting, the closer the corpus is to its
maximum informative size.  It shouldn't depend on the language typology,
or on any particular method of deciding what a "word" is, as long as there
is a random(ish) content distribution.

I'd think this could help provide a stopping rule for unlimited data, or
a measure of how good an existing corpus is (by estimating the number of
new terms another x corpus items might provide).  There's a related thread
here regarding bigger, burlier methods of estimating power laws:

I'd love to hear if anybody cranks up some code to test this.  Indeed, I'd
be happy to contribute to that person's beer fund, since I'd like the same
bit of code to ask the same question about finding new phonological segments
in lexicons / open text / field elicitation -- just how good is a very small
corpus likely to be?

Doug Cooper

On 7/5/2016 4:39 PM, 'Sang Yong Lee' sang-yong_lee at sall.com [lexicographylist] 
> Hi!
> There will be a difference whether the corpus be for the major languages or
> for the minority languages. If it is for the minority languages and endangered
> languages, Leonard E. Newell’s /Handbook on Lexicography/ will give you a hint
> for the minimum size of the corpus.
> He shared his experience in the Romblomanon (Philippines) project as follows:
> For example, a frequency count of words in the Romblomanon project revealed
> that fully 2,000 words occurred only once in the first million words of text.
> About forty percent of those words, however, were inflected verb forms.
> (Newell 1995: 43)
> Through his experience he recommends that three million words of text will be
> a modest project to aim for. Next figure is Unique Morphemes Occurring in
> Various Corpus size (Newell 1995: 21).
> In this figure we can find that from three million corpus, 8,000 unique
> morphemes of the frequency of “three times or more” can be collected.
> I hope this info be helpful for you.
> Cordially,
> Sang Yong

Posted by: Doug Cooper <doug.cooper.thailand at gmail.com>


Yahoo Groups Links

<*> To visit your group on the web, go to:

<*> Your email settings:
    Individual Email | Traditional

<*> To change settings online go to:
    (Yahoo! ID required)

<*> To change settings via email:
    lexicographylist-digest at yahoogroups.com 
    lexicographylist-fullfeatured at yahoogroups.com

<*> To unsubscribe from this group, send an email to:
    lexicographylist-unsubscribe at yahoogroups.com

<*> Your use of Yahoo Groups is subject to:

More information about the Lexicography mailing list