Corpora: minimum size of corpus?

Mon Feb 14 19:10:28 UTC 2000

We used a corpus with aproximately 5,000 tagged words for training a neural net
based tagger and the results we have obtained on tagging precisison were quite
high (94 to 96 % precision, on a  text that was rather faulty) later we used
that tagger to tag text from another collection which was hand corrected (here
we used a corpus of a different style, with aproximately 20,000 tagged words
hand corrected) and we retrained our tagger and got 98% precision for well
written text. Using both hand corrected corpora for training a new tagger gave
rise to worth precision results, but the texts were of different genres. From
such small corpora lots of things can be learned...

 [ML96a] Nuno Marques, José Gabriel Lopes. Using Neural Nets for Portuguese
Part-of-Speech Tagging. In Proceedings of the Fifth International Conference on
The Cognitive Science of Natural Language Processing. Dublin City University,
September 2-4 (9 pages). 1996.
[ML96b] Nuno Marques, José Gabriel Lopes. A Neural Network Approach for
Part-of-Speech Tagging. In Proceedings of the Second Workshop on Spoken and
Written Portuguese Language Processing, , pp. 1-9, Curitiba, Brazil, October
21-22. 1996.

These results contrast with results obtained by usinh Hidden Markov models:

? [VMLV95] A. Vilavicencio, N. Marques, G. Lopes, F. Vilavicencio.
Part-of-Speech Tagging for portuguese Texts. In Jacques Wainer e  Ariadne
Carvalho, editors, Advances in Artificial Intelligence: Proceedings of the XII
Brasilian Symposium on Artificial Intelligence, Lecture Notes in Artificial
Intelligence 991, paginas 323-332, Campinas, October 11-13. Springer Verlag.
1995.

And enabled us to automatically tag a whole collection of 40,000,000 words and
use that collection for extracting subcategorization verb frames which were
evaluated and enabled us to identify patterns of tagging errors and obtain
precisions for this extraction higher than 90%.

MLC98a] Nuno Marques, José Gabriel Lopes e Carlos Agra Coelho. Learning Verbal
Transitivity Using LogLinear Models. In Claire Nédellec and Céline Rouveirol,
editores, Proceedings of the 10th European Conference on Machine Learning,
Lecture Notes in Artificial Intelligence 1398, pp. 19-24, Chemnitz, Germany.
Springer Verlag. 1998.
[MLC98b] Nuno Marques, José Gabriel Lopes e Carlos Agra Coelho. Using Loglinear
Clustering for Subcategorization Identification. In Jan M. Zytkow and Mohamed
Quafafou, editores, Proceedings of the Second European Conference on Principles
of Data Mining and Knowledge Discovery, Lecture Notes in Artificial
Intelligence 1510, pp. 379-387, Nantes, France. Springer Verlag. 1998.

On this subject matter there is a PhD thesis that you can consult (in
Portuguese) through the web. contact Nuno Marques (nmm at di.fct.unl.pt). ""Uma
Metodologia Para a Modelação Estatística da Subcategorização Verbal (A
methodology for statistcal modelling of verbal subcategorization)". It was
defended quite recently.

In conclusion, a lot of work can be done with rather small corpora. It depends
on what we want to extract from it and on the methods used to automatically
learn from it.

Best regards,

Gabriel Pereira Lopes

Daniel Riaño wrote:

>         This is a very interesting thread. I'd like to ask the List another
> question related with it (three questions indeed).
>
>         Let's suppose we have a large text corpus of Greek text (or any
> text of a non expansible corpus), and we want to do a grammatical analysis
> of a part of it for a study on a grammatical category (like case, modus,
> number, etc.) from the syntactical point of view. For the analysis we'll
> use a computer editor that helps the human linguist to tag the text in
> every imaginable way. The analyst does a complete morphological and
> semantic description of every word of the text, a skeleton parsing of every
> sentence, puts a tag to every syntagm indicating its function, plus more
> information about anaforic relations, etc, etc. This corpus is homogeneous:
> I mean it is written by only one author in a given period of his life,
> without radical departures from the main narrative, either in style or in
> the subject. Now the (first) question: what is the minimum percentage of
> such corpus we must analyse in order that we may confidently extrapolate
> the results of our analysis to the whole corpus?. I bet staticians have an
> (approximate) answer for that. Bibliography? I also understand that it may
> be probably methodologically preferable to analyse
> several portions of the same size from the text, instead of parsing only
> one longer chunk of continuous text. And the third question: for such a
> project, what would be the minimum size of the analysed corpus? Any help
> welcome.
>
> ~~~~~~~~~~~~~~~~~~~
> Daniel Riaño Rufilanchas
> Madrid, España
>
> Por favor, tomad nota de la nueva dirección de correo: danielrr at retemail.es