[Lexicog] POS Annotation Tools for Bantu Langgauges
Mike Maxwell
maxwell at LDC.UPENN.EDU
Sat Oct 25 19:21:25 UTC 2008
Today's msg by Piotr about a language ID tool for Bantu languages
reminded me to look at the earlier postings on this topic, of which:
Emmanuel HABUMUREMYI wrote:
> ...I was in need of a open source, or free software that can be
> customised to certain language on one's choice and provide results
> similar to the ones from: http://ucrel.lancs.ac.uk/claws/trial.html
> (appropiate for English) or http://aflat.org/?q=node/10 (for
> Kiswahili).
As described at the latter website, the Kiswahili tagger uses a free
memory based tagger, available from http://ilk.uvt.nl/mbt/, which can
indeed be customized to a particular language. Such taggers are
typically built by hand-tagging a starter corpus, then "training" the
tagger on that hand-tagged corpus.
There are various algorithms used by such taggers, and many freely
available taggers; one place to look is
http://nlp.stanford.edu/links/statnlp.html#Taggers. I'm not familiar
with the advantages/ disadvantages of the different approaches. Nor do
I know how well they do on highly inflected languages like the Bantu
languages vs. more isolating languages like English. (Ideally, of
course, they would do either.) Lastly, I don't know how much of a
training corpus they need in order to do a good job (and of course
"good" is an elastic term), but I would guess you might be looking at
100,000 words of hand-tagged text for each language.
I believe it should be possible to take the information used by a tagger
for one language, and have a program "learn" to tag a related language
with less hand-tagged data. Unfortunately, I don't know anyone who is
working in that. (I would be very interested to learn that I am wrong!)
If you need guidance on choosing and training taggers, there's a mailing
list that often deals with such things, Corpora. It is archived at
http://listserv.linguistlist.org/archives/corpora.html, and you can
subscribe from there.
--
Mike Maxwell
maxwell at ldc.upenn.edu
------------------------------------
Yahoo! Groups Links
<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/lexicographylist/
<*> Your email settings:
Individual Email | Traditional
<*> To change settings online go to:
http://groups.yahoo.com/group/lexicographylist/join
(Yahoo! ID required)
<*> To change settings via email:
mailto:lexicographylist-digest at yahoogroups.com
mailto:lexicographylist-fullfeatured at yahoogroups.com
<*> To unsubscribe from this group, send an email to:
lexicographylist-unsubscribe at yahoogroups.com
<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
More information about the Lexicography
mailing list