Corpora: Relatve text length

David Horowitz dhorowitz at voxgeneration.com
Thu Apr 25 19:32:10 UTC 2002


Hello,

My name is David Horowitz, I have begun working in Natural Language as
an Electrical Engineer by training.

I have a question:  Why is it interesting to know the average
word-length of a language or the number of characters in a book?
Forgive me, I don't know the literature and research hypotheses, and am
sincerely interested to know!

>From my perspective as a speech scientist trained at MIT, written
language is the orthographic representation of speech.  Speech is highly
constrained by the phonotactic constraints governed by the physiology of
the speech apparatus.  It seems to me, more insight can be gleaned from
word length by studying speech (physiology and acoustics) and phonology
modelling or governing the language's syllable structure.  I'd be
interested to know how computational linguistics working on corpora can
statistically illucidate these constraints.  In my research approaches,
I use computational linguistics for NLP and speech recognition, but I
make hybrid systems embodying constraints such as these to help the
approach along.

Sincerely,

David Horowitz


-----Original Message-----
From: Tadeusz Piotrowski [mailto:tadpiotr at plusnet.pl]
Sent: 25 April 2002 19:23
To: 'Yorick Wilks'; corpora at lists.uib.no
Subject: RE: Corpora: Relatve text length


Common sense would say that what Prof. Wilks says is right, and I do
believe he's right. This belief seems supported by the average word
length in English and Polish:

Average word length in characters:
5.92 Polish (corpus for frequency dictionary from the 60's) 4.26 English
(LOB)

Thus, it seems straightforward that a text in English should be shorter
than in Polish. Actually, it is very difficult to show this is really
so. One might want to use translations. Here are the results: 

One utility text:

Original English 
characters 95715 
words 14573

Translated Polish 
characters 100756 
words 15243 

So far so good.

But translators have their own individual style. To level that out, I
checked one English text with three Polish translations.

B. Singer On the wagon
Words 4329
Characters 24028

translation1
Words 3396
Characters 22237

translation2
Words 3636
Characters 23866

translation3
Words 3380
Characters 22119

And that is surprising: it is the English text that is longer.

Tadeusz Piotrowski


> -----Original Message-----
> From: owner-corpora at lists.uib.no
> [mailto:owner-corpora at lists.uib.no] On Behalf Of Yorick Wilks
> Sent: Thursday, April 25, 2002 5:56 PM
> To: James L. Fidelholtz
> Cc: spela vintar; Andrew Bredenkamp; CORPORA at HD.UIB.NO
> Subject: Re: Corpora: Relatve text length
> 
> 
> 
> Isnt there some  (minor) confusion here? If the question
> really is relative TEXT length, then nothing to do with word 
> counts will settle it--what matters is character counts, 
> since word length varies considerably between languages. The 
> table showed 1984 in Estonian as having far fewer word tokens 
> in it than the  English original, but I'd bet theyre much 
> longer ones--how about the texts then?? I have no parallel 
> texts with English and E. European languages but I do with 
> the four major W. European ones and the English pages are 
> shorter in every case. Yorick Wilks
> 
> 
> 
> 
> 
> 
> James L. Fidelholtz" wrote:
> 
> > Andrew and Spela:
> >         Just a word of caution: studies like Spela's provide
> > interesting and suggestive data, but figures will surely vary, 
> > depending on the translator, topic, etc. [all the usual 
> > sociolinguistic caveats apply here] (and note Jean's contribution, 
> > with varying rates).  I was coauthor of a study comparing 
> English and
> > Spanish, which basically tried to get Spanish to fit into
> the standard
> > readability curves in a fairly simple way.  We were only partially
> > successful (the counts were hand-done by yours truly, featuring a 
> > variety of types of text, pseudo-randomly sampled, and especially 
> > translations from one language to the other, as well as 
> translations
> > from 3rd languages [French & German] into each).  To the best of my
> > recollection (I could look up the exact figures if anyone 
> is hot for
> > them), our results for Spanish-English were rather close to
> Jean's for
> > French (I assume his were on large amounts of text done by
> > computer--if this holds up [not surprising, given the close 
> > relationship of French and Spanish], it may indicate that, for this 
> > kind of data, not such a huge amount of text is really necessary).
> >
> > On Wed, 24 Apr 2002, spela vintar wrote:
> >
> > >
> > >Hi Andrew,
> > >
> > >for Eastern-European languages you can compare the lengths of
> > >Orwell's 1984 and its translations that were collected within the 
> > >Multext-East project. The original Multext project 
> > >(http://www.lpl.univ-aix.fr/projects/multext/)
> > >should provide the same for English, German, French, 
> Spanish etc., however I
> > >wasn't able to find it on their homepage at first glance...
> > >
> > >Best,
> > >Spela
> > >
> > >http://nl.ijs.si/ME/CD/docs/mte-d21f/node8.html
> > >//////////////
> > >...
> > >Below we give an estimate for the number of words, by
> language. The
> > >wordcounts were produced by removing the SGML tags from
> the texts and
> > >then using a 'wc'-like procedure.
> > >
> > >  English
> > >            104.302
> > >  Romanian
> > >            101.460
> > >  Slovene
> > >             91.619
> > >  Bulgarian
> > >             87.235
> > >  Czech
> > >             80.366
> > >  Hungarian
> > >             81.147
> > >  Estonian
> > >             79.334
> > >
> > >
> > >Andrew Bredenkamp wrote:
> > >
> > >> Hello everyone,
> > >>
> > >> Does anyone know where I can find a list of relative text length?
> > >>
> > >> Taking one language as an index (100), I would like a
> list of the
> > >> (other) main European languages - e.g. (made up):
> > >>
> > >> Spanish: 100
> > >> English: 105
> > >> French: 110
> > >> German: 85
> > >>
> > >> ... etc.
> > >>
> > >> Thanks a lot in advance for any help you can give me.
> > >>
> > >> Cheers,
> > >> Andrew
> > >> =========================================
> > >> Andrew Bredenkamp
> > >> acrolinx GmbH
> > >> URL:            www.acrolinx.com
> > >>
> > >> =========================================
> > >
> > >
> > >
> >
> > --
> > James L. Fidelholtz                     e-mail: jfidel at siu.buap.mx
> > Posgrado en Ciencias del Lenguaje       tel.: +(52-2)229-5500 x5705
> > Instituto de Ciencias Sociales y Humanidades    fax: 
> +(01-2) 229-5681
> > Benemérita Universidad Autónoma de Puebla, MÉXICO
> 
> 




************************************************************@~|!+
This email and any attachments are confidential and are intended for the addressee(s) only.
If you are not the intended recipient, please notify the sender immediately by reply email and
delete this email and its attachments from your system. Any disclosure, forwarding or copying 
of this email or its attachments is expressly prohibited.
Vox Generation Limited,Golden Cross House, 8 Duncannon Street, London WC2N 4JF, Registered in England: 3937784.
************************************************************@~|!+



More information about the Corpora mailing list