<!doctype html public "-//w3c//dtd html 4.0 transitional//en">

<html>

Dear readers of the corpora list,

<p>1) I have a question concerning the relation between the required minimum

size of a training corpus and the performance of the tagger being trained.

In an article I found a reference to:

<p>J. M. Baker, 1982, <i>The Performing Arts - How to measure up!</i> Proceedings

of the NBS Workshop on Standardisation for Speech I/O Technology, pp 25-33.

<p>In this paper Baker describes the minimal number of test tokens using

the following formula:

<p>n=4 x 10<sup>4+log 1/x</sup>

<p>where "x" is the error rate of the tagger.

<br>My question is if this formula is the only one or if there are any

other formulas for computing size of training corpora?

<br> 

<p>2) I also have a second question concerning studies made about the relation

between a tagger's performance in relation to the tagset size. I found

two articles so far (Zavrel and Daelemans, 1999, <i>Recent Advances in

Memory-Based Part-of-Speech Tagging</i>, Tilburg University) and (Elworth,

1995, <i>Tagset Design and Inflected Languages</i>, Sharp Laboratories

of Europe Ltd, Oxford.)

<br>Does anyone know of any other studies?

<p>Best regards,

<p>Sofie Johansson Kokkinakis

<pre>-- 

************************************************************************

* Sofie Johansson Kokkinakis   sofie.johansson.kokkinakis@svenska.gu.se*

* Systemanalyst/ Ph.D Student           <A HREF="http://svenska.gu.se/~svesj/">http://svenska.gu.se/~svesj/</A>   *

* Språkdata, Inst. för svenska språket  Tel: +46 (0)31 773 5281        *

* (Dept. of Swedish Language)           Fax: +46 (0)31 773 4455        *

* Göteborgs universitet, Box 200        SE 405 30 GÖTEBORG, Sweden     *

*       Computers are not intelligent. They just think they are.       *

************************************************************************</pre>

 </html>