Corpora: Relatve text length

Tadeusz Piotrowski tadpiotr at plusnet.pl
Thu Apr 25 18:23:17 UTC 2002


Common sense would say that what Prof. Wilks says is right, and I do
believe he's right. This belief seems supported by the average word
length in English and Polish:

Average word length in characters:
5.92 Polish (corpus for frequency dictionary from the 60's) 4.26 English
(LOB)

Thus, it seems straightforward that a text in English should be shorter
than in Polish. Actually, it is very difficult to show this is really
so. One might want to use translations. Here are the results: 

One utility text:

Original English 
characters 95715 
words 14573

Translated Polish 
characters 100756 
words 15243 

So far so good.

But translators have their own individual style. To level that out, I
checked one English text with three Polish translations.

B. Singer On the wagon
Words 4329
Characters 24028

translation1
Words 3396
Characters 22237

translation2
Words 3636
Characters 23866

translation3
Words 3380
Characters 22119

And that is surprising: it is the English text that is longer.

Tadeusz Piotrowski


> -----Original Message-----
> From: owner-corpora at lists.uib.no
> [mailto:owner-corpora at lists.uib.no] On Behalf Of Yorick Wilks
> Sent: Thursday, April 25, 2002 5:56 PM
> To: James L. Fidelholtz
> Cc: spela vintar; Andrew Bredenkamp; CORPORA at HD.UIB.NO
> Subject: Re: Corpora: Relatve text length
> 
> 
> 
> Isnt there some  (minor) confusion here? If the question
> really is relative TEXT length, then nothing to do with word 
> counts will settle it--what matters is character counts, 
> since word length varies considerably between languages. The 
> table showed 1984 in Estonian as having far fewer word tokens 
> in it than the  English original, but I'd bet theyre much 
> longer ones--how about the texts then?? I have no parallel 
> texts with English and E. European languages but I do with 
> the four major W. European ones and the English pages are 
> shorter in every case. Yorick Wilks
> 
> 
> 
> 
> 
> 
> James L. Fidelholtz" wrote:
> 
> > Andrew and Spela:
> >         Just a word of caution: studies like Spela's provide
> > interesting and suggestive data, but figures will surely vary, 
> > depending on the translator, topic, etc. [all the usual 
> > sociolinguistic caveats apply here] (and note Jean's contribution, 
> > with varying rates).  I was coauthor of a study comparing 
> English and
> > Spanish, which basically tried to get Spanish to fit into
> the standard
> > readability curves in a fairly simple way.  We were only partially
> > successful (the counts were hand-done by yours truly, featuring a 
> > variety of types of text, pseudo-randomly sampled, and especially 
> > translations from one language to the other, as well as 
> translations
> > from 3rd languages [French & German] into each).  To the best of my
> > recollection (I could look up the exact figures if anyone 
> is hot for
> > them), our results for Spanish-English were rather close to
> Jean's for
> > French (I assume his were on large amounts of text done by
> > computer--if this holds up [not surprising, given the close 
> > relationship of French and Spanish], it may indicate that, for this 
> > kind of data, not such a huge amount of text is really necessary).
> >
> > On Wed, 24 Apr 2002, spela vintar wrote:
> >
> > >
> > >Hi Andrew,
> > >
> > >for Eastern-European languages you can compare the lengths of
> > >Orwell's 1984 and its translations that were collected within the 
> > >Multext-East project. The original Multext project 
> > >(http://www.lpl.univ-aix.fr/projects/multext/)
> > >should provide the same for English, German, French, 
> Spanish etc., however I
> > >wasn't able to find it on their homepage at first glance...
> > >
> > >Best,
> > >Spela
> > >
> > >http://nl.ijs.si/ME/CD/docs/mte-d21f/node8.html
> > >//////////////
> > >...
> > >Below we give an estimate for the number of words, by
> language. The
> > >wordcounts were produced by removing the SGML tags from
> the texts and
> > >then using a 'wc'-like procedure.
> > >
> > >  English
> > >            104.302
> > >  Romanian
> > >            101.460
> > >  Slovene
> > >             91.619
> > >  Bulgarian
> > >             87.235
> > >  Czech
> > >             80.366
> > >  Hungarian
> > >             81.147
> > >  Estonian
> > >             79.334
> > >
> > >
> > >Andrew Bredenkamp wrote:
> > >
> > >> Hello everyone,
> > >>
> > >> Does anyone know where I can find a list of relative text length?
> > >>
> > >> Taking one language as an index (100), I would like a
> list of the
> > >> (other) main European languages - e.g. (made up):
> > >>
> > >> Spanish: 100
> > >> English: 105
> > >> French: 110
> > >> German: 85
> > >>
> > >> ... etc.
> > >>
> > >> Thanks a lot in advance for any help you can give me.
> > >>
> > >> Cheers,
> > >> Andrew
> > >> =========================================
> > >> Andrew Bredenkamp
> > >> acrolinx GmbH
> > >> URL:            www.acrolinx.com
> > >>
> > >> =========================================
> > >
> > >
> > >
> >
> > --
> > James L. Fidelholtz                     e-mail: jfidel at siu.buap.mx
> > Posgrado en Ciencias del Lenguaje       tel.: +(52-2)229-5500 x5705
> > Instituto de Ciencias Sociales y Humanidades    fax: 
> +(01-2) 229-5681
> > Benemérita Universidad Autónoma de Puebla, MÉXICO
> 
> 



More information about the Corpora mailing list