[Corpora-List] Free morphological analyser for Polish

Adam Radziszewski kocikikut at gmail.com
Wed Apr 6 14:06:33 UTC 2011


Dear corpora members,
we've released an open morphological analyser for Polish. The analyser
consists of two parts:
• the morphological dictionary, resulting from tagset conversion from
Morfologik 1.7(morfologik.blogpot.com) — licensed under Creative Commons
ShareAlike or GNU LGPL (the user is free to choose),
• configurable morphological analysis and tokenisation framework called Maca
(GNU GPL; bundled with ready-to-use configurations for Polish and the above
dictionary compiled as a transducer).

The analyser is able to output in the tagset of the IPI PAN Corpus. This is
important, since MSD taggers for Polish (at least TaKIPI and Pantera) resort
to external analysers when tagging plain text — and to the best of our
knowledge, there is no other free combination of a training corpus and an
analyser that operate on the same tagset.

Dictionary “source” and its description:
http://nlp.pwr.wroc.pl/redmine/projects/libpltagger/wiki/Morfologik_converted
The MACA system: http://nlp.pwr.wroc.pl/redmine/projects/libpltagger/wiki/

The mentioned fragment of the IPI PAN corpus is available at:
http://korpus.pl/index.php?lang=en&page=download

It's also worth noting that the MACA suite contains a tokeniser (“toki”)
that is probably the first C++ open-source implementation of SRX
segmentation rules. Both toki and maca proper may be used as shared
libraries or by their simple command-line utils (tested only under
GNU/Linux).

Best regards,
Adam
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110406/3e0efe77/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list