[Corpora-List] extrapolating to 1 million

Lluís Padró padro at lsi.upc.edu
Fri May 15 13:39:26 UTC 2009

En/na Tina Waldman ha escrit:
> Dear members
> Could you tell me what the frequency would be in a corpus of 1 million 
> if I extrapolated from the frequency of  20 in a corpus of 300K?
> Would it be 60 - 20 x 3 ?
   As a rough estimate, that may work.

   Nevertheless, due to Zipf's laws, when you go from 300K to 1M, you're 
getting lots of previously unseen words with very low frequencies, but 
they modify the proability distribution

   For this and other reasons, relative frequencies seem to be less 
stable than that when you use larger corpora.

   You can find out more about it in:
Baroni M., Evert S., "Words and echoes: assessing and mitigating the 
non-randomness problem in word frequency distribution modeling". 
In:Proceedings of ACL 2007, East Stroudsburg PA: ACL, 2007. p. 904-911, 
Atti del convegno: "Association for Computational Linguistics (ACL)", 
Prague, 23rd-30th June 2007.


*Lluís Padró*
Despatx ?-S112
Campus Nord UPC
C/ Jordi Girona 1-3
08034 Barcelona, Spain 	Tel: +34 934 134 015
Fax: +34 934 137 833
padro at lsi.upc.edu <mailto:padro at lsi.upc.es>
www.lsi.upc.edu/~padro <http://www.lsi.upc.es/%7Epadro>
Dept. Llenguatges i Sistemes Informàtics <http://www.lsi.upc.es>
TALP <http://www.talp.upc.es> Research Center

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20090515/c47467ce/attachment.htm>
-------------- next part --------------
Corpora mailing list
Corpora at uib.no

More information about the Corpora mailing list