[Corpora-List] extrapolating to 1 million
Lluís Padró
padro at lsi.upc.edu
Fri May 15 13:39:26 UTC 2009
En/na Tina Waldman ha escrit:
> Dear members
> Could you tell me what the frequency would be in a corpus of 1 million
> if I extrapolated from the frequency of 20 in a corpus of 300K?
>
> Would it be 60 - 20 x 3 ?
>
As a rough estimate, that may work.
Nevertheless, due to Zipf's laws, when you go from 300K to 1M, you're
getting lots of previously unseen words with very low frequencies, but
they modify the proability distribution
For this and other reasons, relative frequencies seem to be less
stable than that when you use larger corpora.
You can find out more about it in:
Baroni M., Evert S., "Words and echoes: assessing and mitigating the
non-randomness problem in word frequency distribution modeling".
In:Proceedings of ACL 2007, East Stroudsburg PA: ACL, 2007. p. 904-911,
Atti del convegno: "Association for Computational Linguistics (ACL)",
Prague, 23rd-30th June 2007.
best,
--
------------------------------------------------------------------------
*Lluís Padró*
Despatx ?-S112
Campus Nord UPC
C/ Jordi Girona 1-3
08034 Barcelona, Spain Tel: +34 934 134 015
Fax: +34 934 137 833
padro at lsi.upc.edu <mailto:padro at lsi.upc.es>
www.lsi.upc.edu/~padro <http://www.lsi.upc.es/%7Epadro>
------------------------------------------------------------------------
UNIVERSITAT POLITÈCNICA DE CATALUNYA
Dept. Llenguatges i Sistemes Informàtics <http://www.lsi.upc.es>
TALP <http://www.talp.upc.es> Research Center
------------------------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20090515/c47467ce/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list