[Corpora-List] State-of-the-art POS tagging results: A summary

Hrafn Loftsson hrafn at ru.is
Thu Nov 27 14:38:57 UTC 2008


Hello all.

I was asked to post a summary regarding the following question I posed
about 2 weeks ago:

"Can anyone point me to papers presenting state-of-the-art POS tagging
results for some morphologically complex languages?

In his paper "Morphological Tagging: Data vs. Dictionaries" (2000), Jan
Hajic presents an evaluation for Czech, Estonian, Hungarian Romanian,
and Slovene, but I wonder if you know of more recent work."


Thanks to all who responded. Here is an extract from the responses:

------------------------------------------------------------------------
Italian is certainly a morphologically rich language, but I do not know
if it is enough complex (in the sense you are interested in)...

In any case last year we set up an evaluation campaign for NLP tools
devoted to Italian and one the tasks was pos-tagging.

You can find all the evaluation results in the EVALITA 2007 web site:
http://evalita.fbk.eu/2007/
------------------------------------------------------------------------

Hebrew and Arabic may count under ``morphologically complex languages".

For Hebrew have a look at:

Roy Bar-Haim, Khalil Sima'an and Yoad Winter.    Part-of-Speech Tagging
of Modern Hebrew Text.  In  Journal of Natural Language Engineering
(J-NLE)
<http://www.cambridge.org/journals/journal_catalogue.asp?mnemonic=nle>,
14(2):223-251, 2008.

the work extended for Arabic:

Saib Mansour, Khalil Sima'an and Yoad Winter. Smoothing a Lexicon-based
POS tagger for Arabic and Hebrew.  In proceedings of  ACL 2007 Workshop
on Computational Approaches to Semitic Languages: Common Issues and
Resources. Prague, Czech Republic, 2007.
------------------------------------------------------------------------

Here's a paper from three years ago that shows results for Arabic,
Korean, and Czech; it does segmentation and tagging within one model.

Context-Based Morphological Disambiguation with Random Fields
Noah A. Smith, David A. Smith, and Roy W. Tromble
In Proceedings of the Human Language Technology Conference and
Conference on Empirical Methods in Natural Language Processing, pages
475-482, Vancouver, BC, October 2005.
------------------------------------------------------------------------

you might want to have a look at the COLING paper of Florian Laws and
myself this year in which we presented a POS tagger for fine-grained
tagsets and evaluated it on German as well as Czech data.
http://www.ims.uni-stuttgart.de/www/projekte/gramotron/PAPERS/COLING08/Schmid-Laws.pdf
------------------------------------------------------------------------

There are some more recent results for Estonian.

There is a paper on statistical tagging of Estonian, by Heiki-Jaan
Kaalep, Tarmo Vaino. "Complete Morphological Analysis in the Linguist’s
Toolbox." Congressus Nonus Internationalis Fenno-Ugristarum Pars V, pp.
9-16, Tartu 2001. http://www.cl.ut.ee/yllitised/smugri_toolbox_2001.pdf

There are several papers on rule-based tagging of Estonian:

Kaili Müürisep, Tiina Puolakainen, Kadri Muischnek, Mare Koit, Tiit
Roosmaa, Heli Uibo. A New Language for Constraint Grammar: Estonian.
International Conference Recent Advances in Natural Language Processing.
Proceedings. Borovets, Bulgaria, 2003, pp. 304-310.
http://math.ut.ee/~kaili/papers/ranlp03.pdf

Kadri Muischnek, Kaili Müürisep, Tiina Puolakainen. Adpositions in
Estonian Computational Syntax. Proceedings of the Second ACL-SIGSEM
Workshop on The Linguistic Dimensions of Prepositions and their Use in
Computational Linguistics Formalisms and Applications. University of
Essex, 19-21 April 2005. Colchester, UK. pp. 2-9.
http://www.cs.ut.ee/~kaili/papers/muischneketal.pdf

Kaili Müürisep, Heli Uibo. Shallow Parsing of Spoken Estonian Using
Constraint Grammar. Treebanking for Discourse and Speech. Proceed. of
NODALIDA 2005 Special Session on Treebanks for Spoken Language and
Discourse (ed. Peter Juel Henrichsen and Peter Rossen Skadhauge);
Copenhagen Studies in Language 32. Samfundslitteratur. 2006. pp.105-118
http://www.cs.ut.ee/~kaili/papers/myyruiboLatex.pdf
------------------------------------------------------------------------

we did a similar study for Russian recently:
http://corpus.leeds.ac.uk/mocky/

There are also more references in the LREC paper available from the same
page.
------------------------------------------------------------------------

Please also check the results from the CADIM group at Columbia on
morphological disambiguation (POS tagging) for Arabic:

Roth, Ryan, Owen Rambow, Nizar Habash, Mona Diab, and Cynthia Rudin.
Arabic Morphological Tagging, Diacritization, and Lemmatization Using
Lexeme Models and Feature Ranking. In Proceedings of Association for
Computational Linguistics (ACL), Columbus, Ohio. 2008.

Diab, Mona. Towards an optimal POS tag set for Modern Standard Arabic
Processing. Recent Advances in Natural Language Processing (RANLP),
Borovets, Bulgaria, 2007.

Diab, Mona, Kadri Hacioglu and Daniel Jurafsky. Automated Methods for
Processing Arabic Text: From Tokenization to Base Phrase Chunking. Book
Chapter. In Arabic Computational Morphology: Knowledge-based and
Empirical Methods. Editors Antal van den Bosch and Abdelhadi Soudi.
Kluwer/Springer Publications, 2007.

Habash, Nizar and Rambow, Owen, 2007. Arabic Diacritization through Full
Morphological Tagging. In Human Language Technologies 2007: The
Conference of the North American Chapter of the Association for
Computational Linguistics (NAACL HLT 2007); Companion Volume, Short
Papers.  [PDF]

Habash, Nizar and Owen Rambow. Arabic Tokenization, Morphological
Analysis, and Part-of-Speech Tagging in One Fell Swoop. In Proceedings
of the Conference of American Association for Computational Linguistics
(ACL05). [PDF]

Diab, Mona, Kadri Hacioglu and Daniel Jurafsky. Automatic Tagging of
Arabic Text: From Raw Text to Base Phrase Chunks. Proceedings of Human
Language Technology-North American Association for Computational
Linguistics (HLT-NAACL), 2004.
------------------------------------------------------------------------

You can find some recent results on Spanish, Romanian and Polish in:

Grzegorz Chrupała, Georgiana Dinu and Josef van Genabith. 2008.
Learning Morphology with Morfette. In Proceedings of LREC 2008.
http://www.lrec-conf.org/proceedings/lrec2008/pdf/594_paper.pdf

There are also some further experiments on those languages as well as
Welsh, Irish, Czech and Slovene in Chapter 6 of:

Grzegorz Chrupała. 2008. Towards a Machine-Learning Architecture for
Lexical Functional Grammar Parsing. PhD dissertation, Dublin City
University.
http://www.lsv.uni-saarland.de/personalPages/gchrupala/papers/phd.pdf
------------------------------------------------------------------------

Some updates of that paper of Hajič's you cite can be found at
http://ufal.mff.cuni.cz/czech-tagging/. You probably want to look at
things of 2005 and onwards.
------------------------------------------------------------------------


--
Regards,
Hrafn Loftsson, Ph.D. - www.ru.is/faculty/hrafn
Assistant Professor
School of Computer Science - www.ru.is/cs
Reykjavik University - www.ru.is


Vinsamlega athugið að upplýsingar í tölvupósti þessum og viðhengi eru eingöngu ætlaðar þeim sem póstinum er beint til og gætu innihaldið upplýsingar sem eru trúnaðarmál. Sjá nánar: http://www.ru.is/trunadur

Please note that this e-mail and attachments are intended for the named addresses only and may contain information that is confidential and privileged. Further information:
http://www.ru.is/trunadur

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list