[Corpora-List] POS tagging without training data?

Gerhard van Huyssteen AFNGBVH at puknet.puk.ac.za
Wed May 21 16:38:25 UTC 2003

Dear list members,

We want to develop a POS tagger for Afrikaans. We only have very small
corpora (roundabout 1,5 million words in total),  none of which is
annotated (with the exception of a tagged lexicon, without any context).
We're considering adapting an existing tagger for, say, English or
Dutch, in order to create training data. We want to know:

(1) What "shell" (e.g. Brill, TnT, TiMBL, TOSCA, etc.) would be the
most effective/efficient to use to create training data? And how much
initial training data (i.e. manually tagged data) is needed to do this
(2) How much training data is needed to develop a reasonably accurate
(let's say 95%) version of, for example, a Brill tagger for Afrikaans?

Thanks in advance for your help. We'll post a summary.

Gerhard van Huyssteen & Sulene Pilon

Dr Gerhard B van Huyssteen
School for Languages || Potchefstroom University for CHE ||
POTCHEFSTROOM || 2531 || South Africa
Skool vir Tale || Potchefstroomse Universiteit vir CHO || POTCHEFSTROOM
|| 2531 || Suid-Afrika

Tel: +27 18 299 1488
Fax: +27 18 299 1562
afngbvh at puknet.puk.ac.za

Hierdie boodskap (en aanhangsels) is onderhewig aan beperkings en `n
vrywaringsklousule. Volledige besonderhede beskikbaar by
http://www.puk.ac.za/itb/e-pos/disclaimer.html, of by
itbsekr at puknet.puk.ac.za
This message (and attachments) is subject to restrictions and a
disclaimer. Please refer to
http://www.puk.ac.za/itb/e-pos/disclaimer.html for full details, or at
itbsekr at puknet.puk.ac.za

More information about the Corpora mailing list