Corpora: size of training corpus and tagset size
Sofie J K
Sofie.Johansson.Kokkinakis at svenska.gu.se
Wed Jan 9 11:14:50 UTC 2002
Dear readers of the corpora list,
1) I have a question concerning the relation between the required
minimum size of a training corpus and the performance of the tagger
being trained. In an article I found a reference to:
J. M. Baker, 1982, The Performing Arts - How to measure up! Proceedings
of the NBS Workshop on Standardisation for Speech I/O Technology, pp
25-33.
In this paper Baker describes the minimal number of test tokens using
the following formula:
n=4 x 104+log 1/x
where "x" is the error rate of the tagger.
My question is if this formula is the only one or if there are any other
formulas for computing size of training corpora?
2) I also have a second question concerning studies made about the
relation between a tagger's performance in relation to the tagset size.
I found two articles so far (Zavrel and Daelemans, 1999, Recent Advances
in Memory-Based Part-of-Speech Tagging, Tilburg University) and
(Elworth, 1995, Tagset Design and Inflected Languages, Sharp
Laboratories of Europe Ltd, Oxford.)
Does anyone know of any other studies?
Best regards,
Sofie Johansson Kokkinakis
--
************************************************************************
* Sofie Johansson Kokkinakis sofie.johansson.kokkinakis at svenska.gu.se*
* Systemanalyst/ Ph.D Student http://svenska.gu.se/~svesj/ *
* Språkdata, Inst. för svenska språket Tel: +46 (0)31 773 5281 *
* (Dept. of Swedish Language) Fax: +46 (0)31 773 4455 *
* Göteborgs universitet, Box 200 SE 405 30 GÖTEBORG, Sweden *
* Computers are not intelligent. They just think they are. *
************************************************************************
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20020109/ed92d7ea/attachment.htm>
More information about the Corpora
mailing list