[Corpora-List] morphological analysis: Russian

Sun Mar 12 22:34:11 UTC 2006

About a month ago, I requested information on this topic and promised to
post a summary of the replies I received.  Here it is (thanks to all you
informative resources!):

Grigori Sidorov writes:
If you want just separation of words into stem and flexion, then you can
use our system of morphological analysis.
This system does not have information about suffixes or prefixes.

www.cic.ipn.mx/~sidorov/rmorph

The example at the page does not present separation into stem and flexion,
but the system has this function.

---------

Roman Yangarber writes:
the only thing i am aware of that is in existence, is a tool (being?)
developed by a team at the now impoverished academy of sciences in Moscow.
(it's headed by Igor Boguslavsky.)  we had talked about some collaboration
a few years back (in the context of information extraction), but i've not
had an opportunity to evaluate the tool myself.  i just know something
exists.

---------

Eric Atwell writes:
if you cant find a good morphological analyser for Russian,
then try an unsupervised learning system: the EU PASCAL research network
has just run the MorphoChallenge2005 contest to devleop unsupervised
learning systems to learn morpohlogical analysis from corpus data.  The
contestants were evaluated for English, Finnish, and Turkish, but
hopefully systems general enough to learn morpholoigcal segmentation for
these 3 different langauges should also cope with Russian.  Winner(s)
are still to be announced - see http://www.cis.hut.fi/morphochallenge2005/

--------

Jonathan Young writes:
Here's what I found in my inventory and from a quick google search:

- ispell has wordlists for russian; Stanford pointed me to the broken
link ftp://mch5.chem.msu.su/pub/russian/ispell/ ("rus-ispell"), but
there appear to be several.  While not a true morphological analyzer,
the wordlists have sigificant structure because of the /XYZ suffix
codes, which code which endings follow each "root".  It's totally
uninterpreted (words are just character strings, not lemmas/morphs, no
POS tags, etc.), but it might be a good starting point.

- http://www.artint.ru/projects/frqlist/frqlist-en.asp contains russian
word and lemma frequency lists (similar to Adam Kilgarriff's frequency
lists for the BNC; they appear to have a corpus of a similar size, but I
can't find it), a paper, and a link to another morphological analyzer:
Dialing, at http://www.aot.ru/ .  My russian isn't good enough to tell
exactly what the AOT folks are doing, but there's plenty of technology
documentation, as well as a free download of both Linux and Windows
versions of their (probably commercially sold) lemmatizer and a Python
scripting interface.

- FreeBSD appears to have lemmatizers for English, German, and Russian -
one source is
http://osmirrors.cerias.purdue.edu/pub/FreeBSD/distfiles/lemmatizer/ .
I haven't tested this, but it looks promising.  It may also be the same
technology as the aot.ru lemmatizer.

- XRCE has a demo at
www.xrce.xerox.com/competencies/content-analysis/demos/russian
(commercial).

- http://clr.nmsu.edu/Research/Projects/tide/Russian.html mentions an
algorithm by Svetlana Sheremetyeva and Sergei Nirenburg, but all I can
find on the web is links to papers.

- There's a demo at http://starling.rinet.ru/morph.htm .  Dictionaries
and executable code is downloadable from
http://starling.rinet.ru/downl.php?lan=en#dict , but the dictionaries
are not really human-readable.

- http://snowball.tartarus.org/algorithms/russian/stemmer.html documents
in great detail a Russian stemmer written in Snobol .  This is (IMHO)
older, more primitive technology, and (similar to the well-known Porter
stemmer for English) it is based on a small number of hand-coded rules,
and is unlikely to include many well-known special cases (e.g. most
irregular verbs).

- RussianStemmer.java and other utilities in Lucene (my notes say
LuBo?),  Lucene is now part of apache, and can be found at
http://lucene.apache.org/ .  The code cites the russian stemmer at
http://snowball.sourceforge.net .

- PyStemmer at http://sourceforge.net/projects/pystemmer (v 0.10 is the
only version released).  According to the project page,  "PyStemmer
provides stemmer functionality in Python for English, German, Norwegian,
Italian, Dutch, Portuguese, French, Swedish. PyStemmer is based on the
Snowball stemmer (snowball.sourceforge.net)" - but it also has rules for
Russian.  Note that the same snowball stemmer source is cited.

- Unitex v 1.2 has support for russian, but it appears to be mostly
empty stubs.

--------

Lars Borin writes:

Please have a look at this site: <http://www.aot.ru/>

--------

As far as I know the matter, there are relatively small amount of
corpora, concerning the Russian word segmentation. Actually, I can point
the follows:

1.      http://www.philol.msu.ru/~lex/corpus/  (200.000 running words,
about
5.500 models of word-formations). Unfortunately, the corpus is not
available since last autumn.
2.      http://www.ruscorpora.ru (more than 65 billion running words, some
information about word structure can be extracted using semantic
annotation tags (such as diminutive, having IK-suffix, like in sadIK
^?small garden^?)

3. I know only one (rather small) dictionary on the Net that represents
a word structure of about 3.000 Russian words. It is the Russian
Derivational and Morphemes Dictionary, prepared in Kazhan^?
(http://www.kcn.ru/tat_ru/universitet/infres/slovar/index.htm)

Two other recourses are not available through the Internet, but one can
try to contact with a person in charge within the following projects.
4. A Computer Implementation of Russian Derivational Morphology
represented in DATR (http://www.surrey.ac.uk/LIS/SMG/lever_final_desc.htm)
5. An Electronic Dictionary of Russian morphemes, (that is based on the
comprehensive Dictionary of Russian Morphemes, by A.I. Kuznetsov & T.F.
Efremova). Actually, the author of the database is a school teacher
Tatiana Sentsova form Moscow, and I have no idea about how to reach her.
....
Finally, I can send you two reviews on the topics. Both are written in
Russian.
1.      S. Koval', Resursy po russkoi morfologii v internete
2.      T. Reznikova, M. Kopotev. ^?Lingvisticheski annotirovannye korpusa
russkogo yazyka (obzor obschedostupnyh resursov)^?  Natsional'nyi korpus
russkogo yazyka 2003-2005. Moscow, 2006 [in press]

---------

Jasper Holmes writes:
I'm sure that the Surrey Morphology Group
(http://www.surrey.ac.uk/LIS/SMG/) will have some relevant
information.