Corpora: Corpus size

Marco Antonio Esteves da Rocha marcor at cce.ufsc.br
Mon Jun 11 18:32:25 UTC 2001


On Sun, 3 Jun 2001, Norbert Schlueter wrote:

> Dear all,
>
> size, i.e. number of words, is obviously not the only factor when
> compiling a corpus for special investigations. Far more important
> seems to be to get at least 400 cases of whatever you are looking for.
> It can be shown that even in the worst case of a balanced distribution
> when looking at a variable with two values [e.g. ASPECT:
> progressive/non-progressive --> 50%/50%] the results will be
> significant at the alpha=0.05 level (n = (4*p*(1-p))/alpha^2). I
> wonder if anyone has done some work on this and can comment on the
> number of necessary cases if the variable has got more than two values
> (e.g. SUBJECT: 1PSG, 2PSG, etc.)
>

One rule of thumb commonly included in statistics textbooks in
cross-tabulations is:

- aim for a minimum of ten cases per cell
- add ten cases for every four cells

Thus, a 2 x 2 table would require a sample size of 50 cases (4 cells x 10
cases + 10 cases = 50)

A 3 x 2 table: 6 cells x 10 cases + 15 (10 + 5) cases = 75 cases

A 4 x 4 table: 16 cells x 10 cases + 40 cases = 200 cases

A 20 x 10 table: 200 cells x 10 cases + 500 cases = 2500 cases

This assumes you are cross-tabulating two variables.

It is not particularly sophisticated, but it is reliable in most designs.

What I find somewhat risky is using sample size to reach significance. Of
course there is plenty of debate about that in the literature, but running
an association test definitely improves the reliability of results
concerning relationships between two variables.

The SUBJECT design above might be dealt with by cross-tabulating SUBJECT
by NON-SUBJECT having 1PSG, 2PSG, etc., as categories classifying cases in
each of those two variables. If I understood the idea correctly, sample
size required would be:

12 cells (3 persons, singular and plural)(SUBJ,NONSUBJ) X 10 + 30 (3
groups of four cells X 10) = 150 cases

A little more economical as compared to 400 cases. It might yield
significance results which are not so strogly influenced by sheer sample
size, and this may possibly be more reliable, although I am not so sure
about that. I would still prefer to check association with tau or some
other association measure thought to be more adequate.


Marco Rocha
marcor at cce.ufsc.br



More information about the Corpora mailing list