Corpora: Relatve text length

James L. Fidelholtz jfidel at siu.buap.mx
Thu Apr 25 15:27:25 UTC 2002


Andrew and Spela:
	Just a word of caution: studies like Spela's provide interesting
and suggestive data, but figures will surely vary, depending on the
translator, topic, etc. [all the usual sociolinguistic caveats apply
here] (and note Jean's contribution, with varying rates).  I was
coauthor of a study comparing English and Spanish, which basically tried
to get Spanish to fit into the standard readability curves in a fairly
simple way.  We were only partially successful (the counts were
hand-done by yours truly, featuring a variety of types of text,
pseudo-randomly sampled, and especially translations from one
language to the other, as well as translations from 3rd languages
[French & German] into each).  To the best of my recollection (I could
look up the exact figures if anyone is hot for them), our results for
Spanish-English were rather close to Jean's for French (I assume his
were on large amounts of text done by computer--if this holds up [not
surprising, given the close relationship of French and Spanish], it may
indicate that, for this kind of data, not such a huge amount of text is
really necessary).

On Wed, 24 Apr 2002, spela vintar wrote:

>
>Hi Andrew,
>
>for Eastern-European languages you can compare the lengths of Orwell's 1984
>and its translations that were collected within the Multext-East project.
>The original Multext project (http://www.lpl.univ-aix.fr/projects/multext/)
>should provide the same for English, German, French, Spanish etc., however I
>wasn't able to find it on their homepage at first glance...
>
>Best,
>Spela
>
>http://nl.ijs.si/ME/CD/docs/mte-d21f/node8.html
>//////////////
>...
>Below we give an estimate for the number of words, by language. The
>wordcounts were produced by removing the SGML tags from the texts and then
>using a 'wc'-like procedure.
>
>  English
>            104.302
>  Romanian
>            101.460
>  Slovene
>             91.619
>  Bulgarian
>             87.235
>  Czech
>             80.366
>  Hungarian
>             81.147
>  Estonian
>             79.334
>
>
>Andrew Bredenkamp wrote:
>
>> Hello everyone,
>>
>> Does anyone know where I can find a list of relative text length?
>>
>> Taking one language as an index (100), I would like a list of the (other)
>> main European languages - e.g. (made up):
>>
>> Spanish: 100
>> English: 105
>> French: 110
>> German: 85
>>
>> ... etc.
>>
>> Thanks a lot in advance for any help you can give me.
>>
>> Cheers,
>> Andrew
>> =========================================
>> Andrew Bredenkamp
>> acrolinx GmbH
>> URL:            www.acrolinx.com
>>
>> =========================================
>
>
>

-- 
James L. Fidelholtz			e-mail: jfidel at siu.buap.mx
Posgrado en Ciencias del Lenguaje	tel.: +(52-2)229-5500 x5705
Instituto de Ciencias Sociales y Humanidades	fax: +(01-2) 229-5681
Benemιrita Universidad Autσnoma de Puebla, MΙXICO



More information about the Corpora mailing list