[Lexicog] POS Annotation Tools for Bantu Langgauges

Mike Maxwell maxwell at LDC.UPENN.EDU
Sat Oct 25 19:21:25 UTC 2008


Today's msg by Piotr about a language ID tool for Bantu languages
reminded me to look at the earlier postings on this topic, of which:

Emmanuel HABUMUREMYI wrote:
> ...I was in need of a open source, or free software that can be
> customised to certain language on one's choice and provide results
> similar to the ones from: http://ucrel.lancs.ac.uk/claws/trial.html
> (appropiate for English) or http://aflat.org/?q=node/10 (for
> Kiswahili).

As described at the latter website, the Kiswahili tagger uses a free 
memory based tagger, available from http://ilk.uvt.nl/mbt/, which can 
indeed be customized to a particular language.  Such taggers are 
typically built by hand-tagging a starter corpus, then "training" the 
tagger on that hand-tagged corpus.

There are various algorithms used by such taggers, and many freely 
available taggers; one place to look is 
http://nlp.stanford.edu/links/statnlp.html#Taggers.  I'm not familiar 
with the advantages/ disadvantages of the different approaches.  Nor do 
I know how well they do on highly inflected languages like the Bantu 
languages vs. more isolating languages like English.  (Ideally, of 
course, they would do either.)  Lastly, I don't know how much of a 
training corpus they need in order to do a good job (and of course 
"good" is an elastic term), but I would guess you might be looking at 
100,000 words of hand-tagged text for each language.

I believe it should be possible to take the information used by a tagger 
for one language, and have a program "learn" to tag a related language 
with less hand-tagged data.  Unfortunately, I don't know anyone who is 
working in that.  (I would be very interested to learn that I am wrong!)

If you need guidance on choosing and training taggers, there's a mailing 
list that often deals with such things, Corpora.  It is archived at 
http://listserv.linguistlist.org/archives/corpora.html, and you can 
subscribe from there.
-- 
	Mike Maxwell
	maxwell at ldc.upenn.edu

------------------------------------

Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/lexicographylist/

<*> Your email settings:
    Individual Email | Traditional

<*> To change settings online go to:
    http://groups.yahoo.com/group/lexicographylist/join
    (Yahoo! ID required)

<*> To change settings via email:
    mailto:lexicographylist-digest at yahoogroups.com 
    mailto:lexicographylist-fullfeatured at yahoogroups.com

<*> To unsubscribe from this group, send an email to:
    lexicographylist-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/



More information about the Lexicography mailing list