[Corpora-List] calculation problem
Dragomir Radev
radev at umich.edu
Thu Oct 20 16:43:13 UTC 2005
Here is the basic idea - let p_i be a parameter of your model which
tells you how often the word w_i appears in the underlying
distribution. The likelihood of your observation P(data|p_i), namely
500 times out of 5 million, then is a function of p_i.
Different values of p_i could have generated the data that you
observed. You need to compute the probability of the data given all
possible values of p_i. You will therefore obtain a probability
distribution for p_i over the interval [0..1]. To get the distribution
of occurrences p' of w_i in the new corpus, you will have to integrate
pmf=p(p'|p_i) over i from 0 to 1.
In the case of a multinomial distribution with a uniform prior over
[0..1], one particular value of p_i, equal to 500/5000000=0.0001, will
end up being the maximum likelihood estimate p_i_ML of p_i.
D.
STENGERS, Helene wrote:
>
>
>
>
> Hello dear list members,
>
>
> I have an arithmetic question. If a particular expression occurs let's
> say 500 times in a 5 million word corpus, can I assume that there will
> be 100 of these expressions in a one million corpus or is there a
> statistical (probability)formula which I should apply?
>
> Cheers,
>
> Helene Stengers
>
>
>
>
>
--
Dragomir R. Radev radev at umich.edu
Associate Professor of Information, Electrical Engineering and
Computer Science, and Linguistics, the University of Michigan, Ann Arbor
Phone: 734-615-5225 Fax: 734-764-2475 http://www.si.umich.edu/~radev
More information about the Corpora
mailing list