[Corpora-List] Part-of-speech tagger

Tue Nov 12 12:33:04 UTC 2002

On Mon, Nov 11, 2002 at 08:52:20PM -0500, Afsaneh Fazly wrote:
>
> Greetings,
>
>   I need to build a part-of-speech tagger for a new language
> (for which there is no PoS-tagger available). For this, I need
> to hand-annotate a minimum amount of text. I would like to know
> how much text (minimum of course) I need to hand-tag. Also,
> for this much text, what is the reasonable size of the tagset
> used for annotation?
>
> Regards,
>
> Afsaneh

The minimal amount of annotated text, strictly speaking, is probably none.
There is a Computer Speech and Language paper by Julian Kupiec explaining
how and why it is possible to train an HMM-based POS tagger without annotated
text. You do need to make decisions about the tagset, and to create a lexicon
relating words to their possible tags.

But in  practise, most people still use annotated text. A Computational
Linguistics paper by Bernard Merialdo includes a careful measurement of
when using annotated text is helpful (and there is similar work, from
about the same time by David Elworthy). How much you need depends on
the complexity of the tagset and the text that you use. Once again there
is good work by Elworthy (from a 1995 EACL workshop) that explains the
tradeoffs. In practice, many people use tagsets which are close to the
Brown and/or CLAWS tagsets developed in the early years. But languages
differ a lot, so it is probably worth thinking carefully about what you
are doing and why. For languages with richer morphology than English,
part-of-speech tagging might turn out to be trivial if (a big if) you have
a good morphological analyser, impossible otherwise. And so on...

Chris