[Ura-list] Volga-Kama Uralic corpora available

Тимофей Архангельский timarkh at gmail.com
Fri Jun 28 11:49:44 EDT 2019


Dear colleagues,

I'd like to present several corpora of Uralic languages of the Volga-Kama
area I've been working on for the last two years. The main start page is
located here: http://volgakama.web-corpora.net/index_en.html. Here are the
pages for the individual languages:

Udmurt: http://udmurt.web-corpora.net/index_en.html
<http://udmurt.web-corpora.net/>
Komi-Zyrian: http://komi-zyrian.web-corpora.net/index_en.html
<http://komi-zyrian.web-corpora.net/>
Erzya: http://erzya.web-corpora.net/index_en.html
<http://erzya.web-corpora.net/>
Moksha: http://moksha.web-corpora.net/index_en.html
<http://moksha.web-corpora.net/>
Meadow Mari: http://meadow-mari.web-corpora.net/index_en.html
<http://meadow-mari.web-corpora.net/>

For each language, there is a Social media corpus and a "Main" corpus that
includes everything else (mostly news outlets). All corpora have been
morphologically analyzed with rule-based analyzers; in most cases, there
was no subsequent disambiguation. The search interface is available in
English and in Russian; the lemmata have Russian translations. The corpora
vary in size from 14 thousand to 9,5 million words. Apart from that, the
social media corpora also contain Russian text in much larger quantities.

Regarding Meadow Mari corpora: later this year, we are going to join forces
with Jeremy Bradley and his colleagues, who have been working on much
larger and better annotated literary Mari corpora for some time (cf.
corpus.mari-language.com).

Please do not hesitate to send me your questions and comments, I will be
happy to answer them.

Best regards,
Timofey Arkhangelskiy
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/ura-list/attachments/20190628/b5e8eb0e/attachment.html>


More information about the Ura-list mailing list