[Corpora-List] Re: The size of Internet in words

Serge Sharoff s.sharoff at leeds.ac.uk
Tue Jan 20 18:40:44 UTC 2004


It is unusual to have a summary of responses for a query sent few hours ago.
Thanks to Thierry Fontenelle. The answer is provided in the special issue of
Computational Linguistics, Vol 29, No 3.  The introduction written by Adam
Kilgarriff and Greg Grefenstette lists data 30 different Latin-script
languages (obtained through AltaVista in March 2001). The answer for ENglish
is 76,598,718,000 words, German comes the second with 7,035,850,000 words,
French the third (3,836,874,000).

The issue is not yet available via Ingenta, but the introduction is freely
downlodeable from the MIT Press website:
http://www-mitpress.mit.edu/journals/pdf/coli_29_3_333_0.pdf

I was interested in Russian data, but they are available from another
source: Yandex  (the major Russian search engine, http://www.yandex.ru )
indexed 1,5 TB of unique texts (in Russian only), giving in total about 250
billion words (more than in English by Kilgarriff and Greffenstette, but
these are data from Feb 2004).  If more recent data are available for
English and other languages, please let me know.

Best wishes,
Serge

--
Dr. Serge Sharoff
Centre for Translation Studies
School of Modern Languages and Cultures
University of Leeds
Leeds, LS2 9JT

tel: +44(0)113 343 7287
fax: +44(0)113 343 3287
WWW: http://www.comp.leeds.ac.uk/ssharoff/



More information about the Corpora mailing list