[Corpora-List] calculation problem

Marco Baroni baroni at sslmit.unibo.it
Thu Oct 20 14:07:14 UTC 2005


Dear Helene,

Your choice is based on a reasonable "point" estimate of the "true" 
proportion of occurrences of the expression in the population (known as the 
maximum likelihood estimate).

You could also obtain a range of "plausible" values for the estimate by 
running a binomial test (available in most statistical packages) with 
parameters 500 for k (successes) and 5000000 for N (trials). You would then 
get a confidence interval (typically, by default, the 95% confidence 
interval)  for the plausible values that the proportion can have in the 
population.

Multiplying these proportions by 1M would give you a range of plausible 
frequencies of occurrences in the smaller corpus.

In concrete, using the statistical package R (http://www.r-project.org/):

 > binom.test(500,5000000)

	Exact binomial test

data:  500 and 5e+06
number of successes = 500, number of trials = 5e+06, p-value < 2.2e-16
alternative hypothesis: true probability of success is not equal to 0.5 	<- 
ignore this and p-value above
95 percent confidence interval:
  0.0000914261 0.0001091614
sample estimates:
probability of success
                  1e-04

 > 0.0000914261*1000000
[1] 91.4261

 > 0.0001091614*1000000
[1] 109.1614

Thus, you could say that you are 95% confident that the value in the 
smaller corpus ranges btw. approx. 91 and 109.

Of course, in all cases you have to assume that the two corpora can be seen 
as random samples from the same population, which is  almost never the 
case, but there can be more or less serious violations of the assumption.

Hth,

Marco

STENGERS, Helene wrote:
>  
>  
> Hello dear list members,
>  
>  
> I have an arithmetic question. If a particular expression occurs let's
> say 500 times in a 5 million word corpus, can I assume that there will
> be 100 of these expressions in a one million corpus or is there a
> statistical (probability)formula  which I should apply?
>  
> Cheers,
>  
> Helene Stengers
> 
> 

-- 
Marco Baroni
SSLMIT, University of Bologna
http://sslmit.unibo.it/~baroni



More information about the Corpora mailing list