Corpora: Corpus size

Norbert Schlueter nosch at
Sun Jun 3 12:32:15 UTC 2001

Dear all,

size, i.e. number of words, is obviously not the only factor when
compiling a corpus for special investigations. Far more important
seems to be to get at least 400 cases of whatever you are looking for.
It can be shown that even in the worst case of a balanced distribution
when looking at a variable with two values [e.g. ASPECT:
progressive/non-progressive --> 50%/50%] the results will be
significant at the alpha=0.05 level (n = (4*p*(1-p))/alpha^2). I
wonder if anyone has done some work on this and can comment on the
number of necessary cases if the variable has got more than two values
(e.g. SUBJECT: 1PSG, 2PSG, etc.)

Best, Norbert

Norbert Schlüter
English Language Pedagogy
Freie Universität Berlin
nosch at

More information about the Corpora mailing list