Corpora: minimum size of corpus?

Thu Feb 10 14:36:16 UTC 2000

	This is a very interesting thread. I'd like to ask the List another
question related with it (three questions indeed).

	Let's suppose we have a large text corpus of Greek text (or any
text of a non expansible corpus), and we want to do a grammatical analysis
of a part of it for a study on a grammatical category (like case, modus,
number, etc.) from the syntactical point of view. For the analysis we'll
use a computer editor that helps the human linguist to tag the text in
every imaginable way. The analyst does a complete morphological and
semantic description of every word of the text, a skeleton parsing of every
sentence, puts a tag to every syntagm indicating its function, plus more
information about anaforic relations, etc, etc. This corpus is homogeneous:
I mean it is written by only one author in a given period of his life,
without radical departures from the main narrative, either in style or in
the subject. Now the (first) question: what is the minimum percentage of
such corpus we must analyse in order that we may confidently extrapolate
the results of our analysis to the whole corpus?. I bet staticians have an
(approximate) answer for that. Bibliography? I also understand that it may
be probably methodologically preferable to analyse
several portions of the same size from the text, instead of parsing only
one longer chunk of continuous text. And the third question: for such a
project, what would be the minimum size of the analysed corpus? Any help
welcome.

~~~~~~~~~~~~~~~~~~~
Daniel Riaño Rufilanchas
Madrid, España

Por favor, tomad nota de la nueva dirección de correo: danielrr at retemail.es