Corpora: size of training corpus and tagset size

Sofie J K Sofie.Johansson.Kokkinakis at svenska.gu.se
Wed Jan 9 11:14:50 UTC 2002


Dear readers of the corpora list,

1) I have a question concerning the relation between the required
minimum size of a training corpus and the performance of the tagger
being trained. In an article I found a reference to:

J. M. Baker, 1982, The Performing Arts - How to measure up! Proceedings
of the NBS Workshop on Standardisation for Speech I/O Technology, pp
25-33.

In this paper Baker describes the minimal number of test tokens using
the following formula:

n=4 x 104+log 1/x

where "x" is the error rate of the tagger.
My question is if this formula is the only one or if there are any other
formulas for computing size of training corpora?


2) I also have a second question concerning studies made about the
relation between a tagger's performance in relation to the tagset size.
I found two articles so far (Zavrel and Daelemans, 1999, Recent Advances
in Memory-Based Part-of-Speech Tagging, Tilburg University) and
(Elworth, 1995, Tagset Design and Inflected Languages, Sharp
Laboratories of Europe Ltd, Oxford.)
Does anyone know of any other studies?

Best regards,

Sofie Johansson Kokkinakis

--
************************************************************************
* Sofie Johansson Kokkinakis   sofie.johansson.kokkinakis at svenska.gu.se*
* Systemanalyst/ Ph.D Student           http://svenska.gu.se/~svesj/   *
* Språkdata, Inst. för svenska språket  Tel: +46 (0)31 773 5281        *
* (Dept. of Swedish Language)           Fax: +46 (0)31 773 4455        *
* Göteborgs universitet, Box 200        SE 405 30 GÖTEBORG, Sweden     *
*       Computers are not intelligent. They just think they are.       *
************************************************************************


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20020109/ed92d7ea/attachment.htm>


More information about the Corpora mailing list