<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
<html>
Dear readers of the corpora list,
<p>1) I have a question concerning the relation between the required minimum
size of a training corpus and the performance of the tagger being trained.
In an article I found a reference to:
<p>J. M. Baker, 1982, <i>The Performing Arts - How to measure up!</i> Proceedings
of the NBS Workshop on Standardisation for Speech I/O Technology, pp 25-33.
<p>In this paper Baker describes the minimal number of test tokens using
the following formula:
<p>n=4 x 10<sup>4+log 1/x</sup>
<p>where "x" is the error rate of the tagger.
<br>My question is if this formula is the only one or if there are any
other formulas for computing size of training corpora?
<br>
<p>2) I also have a second question concerning studies made about the relation
between a tagger's performance in relation to the tagset size. I found
two articles so far (Zavrel and Daelemans, 1999, <i>Recent Advances in
Memory-Based Part-of-Speech Tagging</i>, Tilburg University) and (Elworth,
1995, <i>Tagset Design and Inflected Languages</i>, Sharp Laboratories
of Europe Ltd, Oxford.)
<br>Does anyone know of any other studies?
<p>Best regards,
<p>Sofie Johansson Kokkinakis
<pre>--
************************************************************************
* Sofie Johansson Kokkinakis sofie.johansson.kokkinakis@svenska.gu.se*
* Systemanalyst/ Ph.D Student <A HREF="http://svenska.gu.se/~svesj/">http://svenska.gu.se/~svesj/</A> *
* Språkdata, Inst. för svenska språket Tel: +46 (0)31 773 5281 *
* (Dept. of Swedish Language) Fax: +46 (0)31 773 4455 *
* Göteborgs universitet, Box 200 SE 405 30 GÖTEBORG, Sweden *
* Computers are not intelligent. They just think they are. *
************************************************************************</pre>
</html>