[Corpora-List] Corpus size and accuracy of frequency listings

Thu Apr 2 13:14:27 UTC 2009

This depends to a large extent upon the nature of the data.  For the
head of the distribution, it is likely to be consistent across a range
of sizes and samples (words like "the" and the like are always
common).  The tail is likely to vary in non-trivial ways.

We actually looked at this problem a long time ago and found that for
some words,  as you see more data, you get a monotonically
increasingly better estimate of what it should be, assuming seeing all
of the data as a yardstick.  But for other words --and I don't mean
obscure ones-- odd patterns happen.

Miles

James Curran and Miles Osborne.  A very very large corpus doesn't
always yield reliable estimates. Joint CoNLL02 - Workshop on Very
Large Corpora, Taipei, Taiwan. 2002
http://www.cogsci.ed.ac.uk/~osborne/convergence.ps.gz

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora