[Corpora-List] calculation problem

Juan Huerta huerta at us.ibm.com
Thu Oct 20 15:50:00 UTC 2005


The answer is correct, but I'd like to offer a slightly different 
explanation:

 
The maximum likelihood estimation of the occurence frequency of that word 
in corpus 1 is = 500/5,000,000 = rate_0

Assuming that the distribution of the words and expresions is similar in 
both corpora,
The maximum likelihood estimation of the frequency of occurrence of that 
word in corpus 2 is = rate_0 * 1000,000 = 100


This is regardless of particular  word distrubution assumptions. The only 
condition is that
the corpus 1 (the 5 million) and the corpus 2 (the 1 million) follow the 
same distribution (i.e.,
they are more or less of the same nature).

-Juan




Sent by:        owner-corpora at lists.uib.no
To:     CORPORA at UIB.NO
cc:      
Subject:        Re: [Corpora-List] calculation problem


Hello Helene,

if you assume that occurences in your corpus are distributed uniformly
(actually the simplest probability distribution ever), you can take this 
100
number

Otherwise, if you use another distribution that better describes behaviour
of the occurences it will influence the number of occurences in the 1
million corpus and will be probably not 100.

Cheers,

Alexander

> --- Ursprüngliche Nachricht ---
> Von: "STENGERS, Helene" <Helene.Stengers at ehb.be>
> An: CORPORA at UIB.NO
> Betreff: [Corpora-List] calculation problem
> Datum: Wed, 19 Oct 2005 14:14:55 +0200 (Romance (zomertijd))
> 
> 
> 
> 
> Hello dear list members,
> 
> 
> I have an arithmetic question. If a particular expression occurs let's
> say 500 times in a 5 million word corpus, can I assume that there will
> be 100 of these expressions in a one million corpus or is there a
> statistical (probability)formula  which I should apply?
> 
> Cheers,
> 
> Helene Stengers
> 
> 

-- 
10 GB Mailbox, 100 FreeSMS/Monat http://www.gmx.net/de/go/topmail
+++ GMX - die erste Adresse für Mail, Message, More +++


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20051020/51e52f17/attachment.htm>


More information about the Corpora mailing list