Corpora: Relatve text length

Martin Wynne martin.wynne at ota.ahds.ac.uk
Fri Apr 26 11:51:41 UTC 2002


The MULTEXT-EAST corpora are available from the TRACTOR archive
(www.tractor.de).
For Orwell's 1984 in original and translations, I looked at the values for
the 'extent' element in the headers and got the following information:

English	104302 words	928986 bytes
Bulgarian	 87235 words	2733655 bytes
Czech	 80366 words	1230804 bytes
Estonian	 79334 words	1066273 bytes
Hungarian	 81167 words	1270210 bytes
Romanian	118093 words	1272607 bytes
Slovene	 91619 words	945857 bytes
Latvian	 81956 words	1051 kb
Lithuanian	 71252 words	904 kb
Serbo-Croatian	89749 words	863 kb
Russian	 76469 words	2.2 mb

Please note that the headers also include caveats and explanations regarding
how the counts were done. Basically, the wordcounts appear to be a count of
the number of tokens in the text, while the byte counts generally include
the header and tags too. Please refer to the actual headers for further
information and acknowledgements of the researchers involved.

__
Martin Wynne
martin.wynne at ota.ahds.ac.uk
Linguistics Officer
Oxford Text Archive

Oxford University Computing Services
13 Banbury Road
Oxford
UK - OX2 6NN
Tel: +44 1865 283299
Fax: +44 1865 273275



More information about the Corpora mailing list