[Corpora-List] calculation problem
Marco Baroni
baroni at sslmit.unibo.it
Thu Oct 20 14:07:14 UTC 2005
Dear Helene,
Your choice is based on a reasonable "point" estimate of the "true"
proportion of occurrences of the expression in the population (known as the
maximum likelihood estimate).
You could also obtain a range of "plausible" values for the estimate by
running a binomial test (available in most statistical packages) with
parameters 500 for k (successes) and 5000000 for N (trials). You would then
get a confidence interval (typically, by default, the 95% confidence
interval) for the plausible values that the proportion can have in the
population.
Multiplying these proportions by 1M would give you a range of plausible
frequencies of occurrences in the smaller corpus.
In concrete, using the statistical package R (http://www.r-project.org/):
> binom.test(500,5000000)
Exact binomial test
data: 500 and 5e+06
number of successes = 500, number of trials = 5e+06, p-value < 2.2e-16
alternative hypothesis: true probability of success is not equal to 0.5 <-
ignore this and p-value above
95 percent confidence interval:
0.0000914261 0.0001091614
sample estimates:
probability of success
1e-04
> 0.0000914261*1000000
[1] 91.4261
> 0.0001091614*1000000
[1] 109.1614
Thus, you could say that you are 95% confident that the value in the
smaller corpus ranges btw. approx. 91 and 109.
Of course, in all cases you have to assume that the two corpora can be seen
as random samples from the same population, which is almost never the
case, but there can be more or less serious violations of the assumption.
Hth,
Marco
STENGERS, Helene wrote:
>
>
> Hello dear list members,
>
>
> I have an arithmetic question. If a particular expression occurs let's
> say 500 times in a 5 million word corpus, can I assume that there will
> be 100 of these expressions in a one million corpus or is there a
> statistical (probability)formula which I should apply?
>
> Cheers,
>
> Helene Stengers
>
>
--
Marco Baroni
SSLMIT, University of Bologna
http://sslmit.unibo.it/~baroni
More information about the Corpora
mailing list