<br>

<br><font size=2 face="sans-serif">The answer is correct, but I'd like

to offer a slightly different explanation:</font>

<br>

<br><font size=2 face="sans-serif"> </font>

<br><font size=2 face="sans-serif">The maximum likelihood estimation of

the occurence frequency of that word in corpus 1 is = 500/5,000,000 = rate_0</font>

<br>

<br><font size=2 face="sans-serif">Assuming that the distribution of the

words and expresions is similar in both corpora,</font>

<br><font size=2 face="sans-serif">The maximum likelihood estimation of

the frequency of occurrence of that word in corpus 2 is = rate_0 * 1000,000

= 100</font>

<br>

<br>

<br><font size=2 face="sans-serif">This is regardless of particular  word

distrubution assumptions. The only condition is that</font>

<br><font size=2 face="sans-serif">the corpus 1 (the 5 million) and the

corpus 2 (the 1 million) follow the same distribution (i.e.,</font>

<br><font size=2 face="sans-serif">they are more or less of the same nature).</font>

<br>

<br><font size=2 face="sans-serif">-Juan</font>

<br>

<br>

<br>

<br>

<p><font size=1 color=#800080 face="sans-serif">Sent by:    

   owner-corpora@lists.uib.no</font>

<p><font size=1 color=#800080 face="sans-serif">To:      

 </font><font size=1 face="sans-serif">CORPORA@UIB.NO</font>

<br><font size=1 color=#800080 face="sans-serif">cc:      

  </font>

<br><font size=1 color=#800080 face="sans-serif">Subject:    

   </font><font size=1 face="sans-serif">Re: [Corpora-List]

calculation problem</font>

<br>

<br>

<br><font size=2><tt>Hello Helene,<br>

<br>

if you assume that occurences in your corpus are distributed uniformly<br>

(actually the simplest probability distribution ever), you can take this

100<br>

number<br>

<br>

Otherwise, if you use another distribution that better describes behaviour<br>

of the occurences it will influence the number of occurences in the 1<br>

million corpus and will be probably not 100.<br>

<br>

Cheers,<br>

<br>

Alexander<br>

<br>

> --- Ursprüngliche Nachricht ---<br>

> Von: "STENGERS, Helene" <Helene.Stengers@ehb.be><br>

> An: CORPORA@UIB.NO<br>

> Betreff: [Corpora-List] calculation problem<br>

> Datum: Wed, 19 Oct 2005 14:14:55 +0200 (Romance (zomertijd))<br>

> <br>

> <br>

>  <br>

>  <br>

> Hello dear list members,<br>

>  <br>

>  <br>

> I have an arithmetic question. If a particular expression occurs let's<br>

> say 500 times in a 5 million word corpus, can I assume that there

will<br>

> be 100 of these expressions in a one million corpus or is there a<br>

> statistical (probability)formula  which I should apply?<br>

>  <br>

> Cheers,<br>

>  <br>

> Helene Stengers<br>

> <br>

> <br>

<br>

-- <br>

10 GB Mailbox, 100 FreeSMS/Monat http://www.gmx.net/de/go/topmail<br>

+++ GMX - die erste Adresse für Mail, Message, More +++<br>

<br>

</tt></font>

<br>