[Corpora-List] calculation problem

Dragomir Radev radev at umich.edu
Thu Oct 20 16:43:13 UTC 2005


Here is the basic idea - let p_i be a parameter of your model which
tells you how often the word w_i appears in the underlying
distribution. The likelihood of your observation P(data|p_i), namely
500 times out of 5 million, then is a function of p_i.

Different values of p_i could have generated the data that you
observed. You need to compute the probability of the data given all
possible values of p_i. You will therefore obtain a probability
distribution for p_i over the interval [0..1]. To get the distribution
of occurrences p' of w_i in the new corpus, you will have to integrate
pmf=p(p'|p_i) over i from 0 to 1. 

In the case of a multinomial distribution with a uniform prior over
[0..1], one particular value of p_i, equal to 500/5000000=0.0001, will
end up being the maximum likelihood estimate p_i_ML of p_i.

D.

STENGERS, Helene wrote:
> 
> 
>  
>  
> Hello dear list members,
>  
>  
> I have an arithmetic question. If a particular expression occurs let's
> say 500 times in a 5 million word corpus, can I assume that there will
> be 100 of these expressions in a one million corpus or is there a
> statistical (probability)formula  which I should apply?
>  
> Cheers,
>  
> Helene Stengers
> 
> 
> 
> 
> 


-- 
Dragomir R. Radev                                         radev at umich.edu
Associate Professor of Information, Electrical Engineering and
Computer Science, and Linguistics, the University of Michigan, Ann Arbor
Phone: 734-615-5225   Fax: 734-764-2475    http://www.si.umich.edu/~radev



More information about the Corpora mailing list