<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
</head>
<body bgcolor="#ffffff" text="#000000">
En/na Tina Waldman ha escrit:
<blockquote cite="mid:000e01c9d55e$5fa4d590$6502a8c0@tinalaptop"
type="cite">
<meta http-equiv="Content-Type" content="text/html; ">
<meta content="MSHTML 6.00.2900.3527" name="GENERATOR">
<style></style>
<div><font face="Arial" size="2">Dear members</font></div>
<div><font face="Arial" size="2">Could you tell me what the frequency
would be in a corpus of 1 million if I extrapolated from the frequency
of 20 in a corpus of 300K?</font></div>
<div> </div>
<div><font face="Arial" size="2">Would it be 60 - 20 x 3 ?</font></div>
<div> </div>
</blockquote>
As a rough estimate, that may work.<br>
<br>
<br>
Nevertheless, due to Zipf's laws, when you go from 300K to 1M,
you're getting lots of previously unseen words with very low
frequencies, but they modify the proability distribution<br>
<br>
For this and other reasons, relative frequencies seem to be less
stable than that when you use larger corpora.<br>
<br>
You can find out more about it in:<br>
Baroni M., Evert S., "Words and echoes: assessing and mitigating the
non-randomness problem in word frequency distribution modeling".
In:Proceedings of ACL 2007, East Stroudsburg PA: ACL, 2007. p. 904-911,
Atti del convegno: "Association for Computational Linguistics (ACL)",
Prague, 23rd-30th June 2007.<br>
<br>
best,<br>
<br>
<div class="moz-signature">-- <br>
<table>
<tbody>
<tr>
<td colspan="2" align="center">
<hr width="100%"></td>
</tr>
<tr>
<td valign="top"><font color="#0000aa"><b>Lluís Padró</b></font><br>
<font color="#2f2f66">Despatx Ω-S112<br>
Campus Nord UPC<br>
C/ Jordi Girona 1-3<br>
08034 Barcelona, Spain</font></td>
<td valign="top"><font color="#0000aa">Tel: <tt><font size="+1">+34
934 134 015</font></tt><br>
Fax: <tt><font size="+1">+34 934 137 833</font></tt></font><br>
<tt><font size="+1"><a href="mailto:padro@lsi.upc.es">padro@lsi.upc.edu</a><br>
<a href="http://www.lsi.upc.es/%7Epadro" target="_top">www.lsi.upc.edu/~padro</a></font></tt></td>
</tr>
<tr>
<td colspan="2" align="center">
<hr width="100%"><font color="#2f2f66">UNIVERSITAT POLITÈCNICA DE
CATALUNYA<br>
Dept. <a href="http://www.lsi.upc.es" target="_top">Llenguatges i
Sistemes Informàtics</a><br>
<a href="http://www.talp.upc.es" target="_top">TALP</a> Research
Center</font>
<hr width="100%"></td>
</tr>
</tbody>
</table>
</div>
</body>
</html>