[Corpora-List] Corpus size and accuracy of frequency listings
Miles Osborne
miles at inf.ed.ac.uk
Thu Apr 2 13:14:27 UTC 2009
This depends to a large extent upon the nature of the data. For the
head of the distribution, it is likely to be consistent across a range
of sizes and samples (words like "the" and the like are always
common). The tail is likely to vary in non-trivial ways.
We actually looked at this problem a long time ago and found that for
some words, as you see more data, you get a monotonically
increasingly better estimate of what it should be, assuming seeing all
of the data as a yardstick. But for other words --and I don't mean
obscure ones-- odd patterns happen.
Miles
James Curran and Miles Osborne. A very very large corpus doesn't
always yield reliable estimates. Joint CoNLL02 - Workshop on Very
Large Corpora, Taipei, Taiwan. 2002
http://www.cogsci.ed.ac.uk/~osborne/convergence.ps.gz
--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list