[Corpora-List] calculation problem
Juan Huerta
huerta at us.ibm.com
Thu Oct 20 15:50:00 UTC 2005
The answer is correct, but I'd like to offer a slightly different
explanation:
The maximum likelihood estimation of the occurence frequency of that word
in corpus 1 is = 500/5,000,000 = rate_0
Assuming that the distribution of the words and expresions is similar in
both corpora,
The maximum likelihood estimation of the frequency of occurrence of that
word in corpus 2 is = rate_0 * 1000,000 = 100
This is regardless of particular word distrubution assumptions. The only
condition is that
the corpus 1 (the 5 million) and the corpus 2 (the 1 million) follow the
same distribution (i.e.,
they are more or less of the same nature).
-Juan
Sent by: owner-corpora at lists.uib.no
To: CORPORA at UIB.NO
cc:
Subject: Re: [Corpora-List] calculation problem
Hello Helene,
if you assume that occurences in your corpus are distributed uniformly
(actually the simplest probability distribution ever), you can take this
100
number
Otherwise, if you use another distribution that better describes behaviour
of the occurences it will influence the number of occurences in the 1
million corpus and will be probably not 100.
Cheers,
Alexander
> --- Ursprüngliche Nachricht ---
> Von: "STENGERS, Helene" <Helene.Stengers at ehb.be>
> An: CORPORA at UIB.NO
> Betreff: [Corpora-List] calculation problem
> Datum: Wed, 19 Oct 2005 14:14:55 +0200 (Romance (zomertijd))
>
>
>
>
> Hello dear list members,
>
>
> I have an arithmetic question. If a particular expression occurs let's
> say 500 times in a 5 million word corpus, can I assume that there will
> be 100 of these expressions in a one million corpus or is there a
> statistical (probability)formula which I should apply?
>
> Cheers,
>
> Helene Stengers
>
>
--
10 GB Mailbox, 100 FreeSMS/Monat http://www.gmx.net/de/go/topmail
+++ GMX - die erste Adresse für Mail, Message, More +++
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20051020/51e52f17/attachment.htm>
More information about the Corpora
mailing list