[Corpora-List] POS tagging without training data?

Wed May 21 16:38:25 UTC 2003

Dear list members,

We want to develop a POS tagger for Afrikaans. We only have very small
corpora (roundabout 1,5 million words in total),  none of which is
annotated (with the exception of a tagged lexicon, without any context).
We're considering adapting an existing tagger for, say, English or
Dutch, in order to create training data. We want to know:

(1) What "shell" (e.g. Brill, TnT, TiMBL, TOSCA, etc.) would be the
most effective/efficient to use to create training data? And how much
initial training data (i.e. manually tagged data) is needed to do this
?
(2) How much training data is needed to develop a reasonably accurate
(let's say 95%) version of, for example, a Brill tagger for Afrikaans?

Thanks in advance for your help. We'll post a summary.

Yours,
Gerhard van Huyssteen & Sulene Pilon

__________________________________________________________
__________________________***_____________________________
Dr Gerhard B van Huyssteen
School for Languages || Potchefstroom University for CHE ||
POTCHEFSTROOM || 2531 || South Africa
Skool vir Tale || Potchefstroomse Universiteit vir CHO || POTCHEFSTROOM
|| 2531 || Suid-Afrika

Tel: +27 18 299 1488
Fax: +27 18 299 1562
afngbvh at puknet.puk.ac.za
__________________________________________________________
__________________________***_____________________________

Hierdie boodskap (en aanhangsels) is onderhewig aan beperkings en `n
vrywaringsklousule. Volledige besonderhede beskikbaar by
http://www.puk.ac.za/itb/e-pos/disclaimer.html, of by
itbsekr at puknet.puk.ac.za
This message (and attachments) is subject to restrictions and a
disclaimer. Please refer to
http://www.puk.ac.za/itb/e-pos/disclaimer.html for full details, or at
itbsekr at puknet.puk.ac.za
__________________________________________________________
__________________________***_____________________________