Corpora: Frequency list for Russian
Serge Sharoff
s_sharoff at yahoo.com
Thu Apr 18 07:49:36 UTC 2002
The list of most frequent Russian words is available at:
http://www.artint.ru/projects/frqlist/frqlist-en.asp
Currently Chastotnyj slovarj russkogo jazyka (Zasorina, 1977)
provides the most widely used frequency list for Russian.
However, the corpus used in Zasorina is relatively small
according to modern standards (about 1 million words). It is
outdated: mostly it covers uses from 1920s to 1960s and includes
a high proportion of ideological sources, like texts by Lenin and
Khrushchev and Soviet newspapers, thus, word frequencies in it
are severely biased. Finally, the list of (Zasorina, 1977) is
not available electronically.
The announced list is compiled on the basis of a corpus of modern
Russian fiction and political texts (more than 35 million words).
The list includes about 33000 words which frequency is greater
than 1 ipm (instances per million words). A shorter selection of
5000 most frequent words is also available.
The structure of the lists follows the template of the lemmatised
BNC lists produced by Adam Kilgariff
(http://www.itri.bton.ac.uk/~Adam.Kilgarriff/bnc-readme.html),
namely:
word rank, frequency (in ipm), word, part of speech.
In addition, some analytical information about the lexical stock
is provided, such as coverage of the total language use by word
bands, e.g. first 3000 lemmas cover 76.6824% of the total number
of word forms.
The corpus, tools for working with it, as well as an aligned
parallel English-Russian corpus are discussed in the forthcoming
publication:
Sharoff, Serge, (2002). Meaning as use: exploitation of aligned
corpora for the contrastive study of lexical semantics. Proc. of
Language Resources and Evaluation Conference (LREC02). May, 2002,
Las Palmas, Spain. http://www.artint.ru/projects/frqlist/lrec-02.pdf
More information about the Corpora
mailing list