Corpora: Relatve text length
Martin Wynne
martin.wynne at ota.ahds.ac.uk
Fri Apr 26 11:51:41 UTC 2002
The MULTEXT-EAST corpora are available from the TRACTOR archive
(www.tractor.de).
For Orwell's 1984 in original and translations, I looked at the values for
the 'extent' element in the headers and got the following information:
English 104302 words 928986 bytes
Bulgarian 87235 words 2733655 bytes
Czech 80366 words 1230804 bytes
Estonian 79334 words 1066273 bytes
Hungarian 81167 words 1270210 bytes
Romanian 118093 words 1272607 bytes
Slovene 91619 words 945857 bytes
Latvian 81956 words 1051 kb
Lithuanian 71252 words 904 kb
Serbo-Croatian 89749 words 863 kb
Russian 76469 words 2.2 mb
Please note that the headers also include caveats and explanations regarding
how the counts were done. Basically, the wordcounts appear to be a count of
the number of tokens in the text, while the byte counts generally include
the header and tags too. Please refer to the actual headers for further
information and acknowledgements of the researchers involved.
__
Martin Wynne
martin.wynne at ota.ahds.ac.uk
Linguistics Officer
Oxford Text Archive
Oxford University Computing Services
13 Banbury Road
Oxford
UK - OX2 6NN
Tel: +44 1865 283299
Fax: +44 1865 273275
More information about the Corpora
mailing list