Corpora: Relatve text length

spela vintar vintar at dfki.de
Wed Apr 24 13:49:25 UTC 2002


Hi Andrew,

for Eastern-European languages you can compare the lengths of Orwell's 1984
and its translations that were collected within the Multext-East project.
The original Multext project (http://www.lpl.univ-aix.fr/projects/multext/)
should provide the same for English, German, French, Spanish etc., however I
wasn't able to find it on their homepage at first glance...

Best,
Spela

http://nl.ijs.si/ME/CD/docs/mte-d21f/node8.html
//////////////
...
Below we give an estimate for the number of words, by language. The
wordcounts were produced by removing the SGML tags from the texts and then
using a 'wc'-like procedure.

  English
            104.302
  Romanian
            101.460
  Slovene
             91.619
  Bulgarian
             87.235
  Czech
             80.366
  Hungarian
             81.147
  Estonian
             79.334


Andrew Bredenkamp wrote:

> Hello everyone,
>
> Does anyone know where I can find a list of relative text length?
>
> Taking one language as an index (100), I would like a list of the (other)
> main European languages - e.g. (made up):
>
> Spanish: 100
> English: 105
> French: 110
> German: 85
>
> ... etc.
>
> Thanks a lot in advance for any help you can give me.
>
> Cheers,
> Andrew
> =========================================
> Andrew Bredenkamp
> acrolinx GmbH
> URL:            www.acrolinx.com
>
> =========================================



More information about the Corpora mailing list