Corpora: minimum size of corpus?

Robert Luk (COMP staff) csrluk at comp.polyu.edu.hk
Fri Feb 11 01:35:59 UTC 2000


> 	This is a very interesting thread. I'd like to ask the List another
> question related with it (three questions indeed).
>
> 	Let's suppose we have a large text corpus of Greek text (or any
> text of a non expansible corpus), and we want to do a grammatical analysis
> of a part of it for a study on a grammatical category (like case, modus,
> number, etc.) from the syntactical point of view. For the analysis we'll
> use a computer editor that helps the human linguist to tag the text in
> every imaginable way. The analyst does a complete morphological and
> semantic description of every word of the text, a skeleton parsing of every
> sentence, puts a tag to every syntagm indicating its function, plus more
> information about anaforic relations, etc, etc. This corpus is homogeneous:
> I mean it is written by only one author in a given period of his life,
> without radical departures from the main narrative, either in style or in
> the subject

> Now the (first) question: what is the minimum percentage of
> such corpus we must analyse in order that we may confidently extrapolate
> the results of our analysis to the whole corpus?. I bet staticians have an
> (approximate) answer for that. Bibliography? I also understand that it may
> be probably methodologically preferable to analyse
> several portions of the same size from the text, instead of parsing only
> one longer chunk of continuous text. And the third question: for such a
> project, what would be the minimum size of the analysed corpus? Any help
> welcome.

I am not a statistican. However, my view is that the size (even for a
homogenous corpus) depends on the outcome unit. For example, the corpus size
for estimating the prob. distribution of word occurrence and sentence structure
occurrence are quite different. The former uses a word as a unit and the latter
uses a sentence as a counting unit.

Because the min size depends on the outcome unit and
what we want to infer (e.g. distribution, probabilities, etc.), the approx.
answer and techniques to derive them will also differ. As others have pointed
out, the general rule is that we should have as large as possible so that we
can always get enough data for most if not all of the investigation. If we
are restricted with data size, perhaps we need to work out what analysis to
be do and what confidence limit can be reached to make certain inference.

I remember that there is a book called Sampling (Techniques?) published
by John Wiley. It has many sampling techniques (like bootstrapping) and
discuss how the size
of the sample is determined for a given level of confidence. I remember
that the sample size also depends on the value of the proportion and
an inversion technique is used for small proportions. I am not sure whether
the Handbook of statistics has anything relevant.

Regards,

Robert Luk



More information about the Corpora mailing list