[Corpora-List] POS tagging without training data?

Chris Brew cbrew at ling.ohio-state.edu
Wed May 21 17:48:45 UTC 2003


If you have the patience, and are willing to acquire (or hire)
the necessary expertise the tagger described in

http://citeseer.nj.nec.com/cutting92practical.html

is likely to do the job for you, without the need for
significant amounts of tagged training data. This
tagger needs a large amount of unlabelled text,
a lexicon, and a little information about morphology
and possible tag sequences. This has been done for
Spanish

http://xxx.arxiv.cornell.edu/abs/cmp-lg/9505035

The original Xerox code is available at:

ftp://ftp.parc.xerox.com/pub/tagger/

If you read about how to build the tagger for Spanish, then do
likewise for Afrikaans, you'll have a decent POS tagger which
may already meet your needs.


In fairness, I should also note

Bernard Merialdo, Tagging English Text with a Probabilistic Model
Computational Linguistics, 1994

which points out that if you _do_ have large amounts of reliably tagged
training data, you may be able to improve your results. But largely
unsupervised tagging is certainly an option worth exploring.

Once you have run your text through your version of the Xerox
tagger, the text will be tagged, probably with decent accuracy.
If, for some reason, this isn't good enough, you could indeed
treat the output (possibly after hand editing)
as training material for some other tagger. It's hard to predict
how much this would help (but Thorsten Brants did some experiments
suggesting that really large amounts of imperfect training data
can be helpful).

Good luck with your enterprise

Chris



>
> We want to develop a POS tagger for Afrikaans. We only have very small
> corpora (roundabout 1,5 million words in total),  none of which is
> annotated (with the exception of a tagged lexicon, without any context).
> We're considering adapting an existing tagger for, say, English or
> Dutch, in order to create training data. We want to know:
>
> (1) What "shell" (e.g. Brill, TnT, TiMBL, TOSCA, etc.) would be the
> most effective/efficient to use to create training data? And how much
> initial training data (i.e. manually tagged data) is needed to do this
> ?
> (2) How much training data is needed to develop a reasonably accurate
> (let's say 95%) version of, for example, a Brill tagger for Afrikaans?
>
> Thanks in advance for your help. We'll post a summary.
>
> Yours,
> Gerhard van Huyssteen & Sulene Pilon
>
>
>
> __________________________________________________________
> __________________________***_____________________________
> Dr Gerhard B van Huyssteen
> School for Languages || Potchefstroom University for CHE ||
> POTCHEFSTROOM || 2531 || South Africa
> Skool vir Tale || Potchefstroomse Universiteit vir CHO || POTCHEFSTROOM
> || 2531 || Suid-Afrika
>
> Tel: +27 18 299 1488
> Fax: +27 18 299 1562
> afngbvh at puknet.puk.ac.za
> __________________________________________________________
> __________________________***_____________________________
>
> Hierdie boodskap (en aanhangsels) is onderhewig aan beperkings en `n
> vrywaringsklousule. Volledige besonderhede beskikbaar by
> http://www.puk.ac.za/itb/e-pos/disclaimer.html, of by
> itbsekr at puknet.puk.ac.za
> This message (and attachments) is subject to restrictions and a
> disclaimer. Please refer to
> http://www.puk.ac.za/itb/e-pos/disclaimer.html for full details, or at
> itbsekr at puknet.puk.ac.za
> __________________________________________________________
> __________________________***_____________________________

--
==================================================================
Dr. Chris Brew,  Assistant Professor of Computational Linguistics
Department of Linguistics, 1712 Neil Avenue, Columbus OH 43210
Tel:  +614 292 5420 Fax: +614 292 8833
Web:http://www.ling.ohio-state.edu/~cbrew Email:cbrew at ling.osu.edu
==================================================================



More information about the Corpora mailing list