Corpora: Relatve text length
James L. Fidelholtz
jfidel at siu.buap.mx
Thu Apr 25 15:27:25 UTC 2002
Andrew and Spela:
Just a word of caution: studies like Spela's provide interesting
and suggestive data, but figures will surely vary, depending on the
translator, topic, etc. [all the usual sociolinguistic caveats apply
here] (and note Jean's contribution, with varying rates). I was
coauthor of a study comparing English and Spanish, which basically tried
to get Spanish to fit into the standard readability curves in a fairly
simple way. We were only partially successful (the counts were
hand-done by yours truly, featuring a variety of types of text,
pseudo-randomly sampled, and especially translations from one
language to the other, as well as translations from 3rd languages
[French & German] into each). To the best of my recollection (I could
look up the exact figures if anyone is hot for them), our results for
Spanish-English were rather close to Jean's for French (I assume his
were on large amounts of text done by computer--if this holds up [not
surprising, given the close relationship of French and Spanish], it may
indicate that, for this kind of data, not such a huge amount of text is
really necessary).
On Wed, 24 Apr 2002, spela vintar wrote:
>
>Hi Andrew,
>
>for Eastern-European languages you can compare the lengths of Orwell's 1984
>and its translations that were collected within the Multext-East project.
>The original Multext project (http://www.lpl.univ-aix.fr/projects/multext/)
>should provide the same for English, German, French, Spanish etc., however I
>wasn't able to find it on their homepage at first glance...
>
>Best,
>Spela
>
>http://nl.ijs.si/ME/CD/docs/mte-d21f/node8.html
>//////////////
>...
>Below we give an estimate for the number of words, by language. The
>wordcounts were produced by removing the SGML tags from the texts and then
>using a 'wc'-like procedure.
>
> English
> 104.302
> Romanian
> 101.460
> Slovene
> 91.619
> Bulgarian
> 87.235
> Czech
> 80.366
> Hungarian
> 81.147
> Estonian
> 79.334
>
>
>Andrew Bredenkamp wrote:
>
>> Hello everyone,
>>
>> Does anyone know where I can find a list of relative text length?
>>
>> Taking one language as an index (100), I would like a list of the (other)
>> main European languages - e.g. (made up):
>>
>> Spanish: 100
>> English: 105
>> French: 110
>> German: 85
>>
>> ... etc.
>>
>> Thanks a lot in advance for any help you can give me.
>>
>> Cheers,
>> Andrew
>> =========================================
>> Andrew Bredenkamp
>> acrolinx GmbH
>> URL: www.acrolinx.com
>>
>> =========================================
>
>
>
--
James L. Fidelholtz e-mail: jfidel at siu.buap.mx
Posgrado en Ciencias del Lenguaje tel.: +(52-2)229-5500 x5705
Instituto de Ciencias Sociales y Humanidades fax: +(01-2) 229-5681
Benemιrita Universidad Autσnoma de Puebla, MΙXICO
More information about the Corpora
mailing list