[Corpora-List] Looking for super large Russian corpus
resnik at umiacs.umd.edu
resnik at umiacs.umd.edu
Thu Oct 28 14:22:17 UTC 2004
Well, since you asked for a huge corpus... In case it might be
useful, we have created a a very large file (122M compressed, 1.8G
uncompressed) containing over 25 million URLs, collected from the
Internet Archive (www.archive.org), for pages that were identified as
Russian by automatic language ID. Some percentage of the URLs will be
stale, of course, and language ID is not perfect, but a large
percentage of the pages should still be out there and the language
identification is pretty accurate. You can download any subset of the
URLs you want, convert to plain text, apply your own stricter language
ID if you'd like, and, voila, a huge collection of Russian text.
The URL list is available from the STRAND download page,
http://umiacs.umd.edu/~resnik/strand/ under "Monolingual Russian".
Philip
----------------------------------------------------------------
Philip Resnik, Associate Professor
Department of Linguistics and Institute for Advanced Computer Studies
1401 Marie Mount Hall UMIACS phone: (301) 405-6760
University of Maryland Linguistics phone: (301) 405-8903
College Park, MD 20742 USA Fax: (301) 314-2644 / (301) 405-7104
http://umiacs.umd.edu/~resnik E-mail: resnik at umiacs.umd.edu
More information about the Corpora
mailing list