[Corpora-List] Looking for super large Russian corpus

resnik at umiacs.umd.edu resnik at umiacs.umd.edu
Thu Oct 28 14:22:17 UTC 2004


Well, since you asked for a huge corpus...  In case it might be
useful, we have created a a very large file (122M compressed, 1.8G
uncompressed) containing over 25 million URLs, collected from the
Internet Archive (www.archive.org), for pages that were identified as
Russian by automatic language ID.  Some percentage of the URLs will be
stale, of course, and language ID is not perfect, but a large
percentage of the pages should still be out there and the language
identification is pretty accurate.  You can download any subset of the
URLs you want, convert to plain text, apply your own stricter language
ID if you'd like, and, voila, a huge collection of Russian text.

The URL list is available from the STRAND download page,
http://umiacs.umd.edu/~resnik/strand/ under "Monolingual Russian".

  Philip

  ----------------------------------------------------------------
  Philip Resnik, Associate Professor
  Department of Linguistics and Institute for Advanced Computer Studies

  1401 Marie Mount Hall            UMIACS phone: (301) 405-6760
  University of Maryland           Linguistics phone: (301) 405-8903
  College Park, MD 20742 USA	   Fax: (301) 314-2644 / (301) 405-7104
  http://umiacs.umd.edu/~resnik	   E-mail: resnik at umiacs.umd.edu



More information about the Corpora mailing list