Corpora: Relatve text length

ramesh at ccl.bham.ac.uk ramesh at ccl.bham.ac.uk
Fri Apr 26 10:26:14 UTC 2002


Dear Yorick

Would morpheme counts not be even more accurate (or
linguistically valid) than counting orthographic characters?
Unfortunately, I don't think anyone has done these yet...

Anyway, I agree that for the moment, character counts
are a useful addition to word counts.

Problems about translation (compensation, explication,
zero translation, etc) obviously apply throughout.

Here are some figures from my own research:

1. FIFA Laws in English, German, Spanish, and French.
French is longest, then Spanish, German, and English.

lines  words  characters   text

  726  10216       56874   Laws97GB.txt
  724   9173       63402   Laws97DE.txt
 1342  11030       63765   Laws97SP.txt
 1169  11763       67537   Laws97FR.txt

2. Canadian Hansard in English and French.
French is longer in both samples.

   lines   words  chars  text

    1569   20336  104015 c1.001.E.A
    1569   22413  124457 c1.002.F.A

    1120   12260   62421 c2.002.E.A
    1120   12135   62622 c2.003.F.A

3. George Orwell's 1984 (thanks to Multext-East and TELRI)
in several languages. These figures were provided by
Dr Tomaz Erjavec (Ljubljana) with various additional caveats:

            line    word   char

English    16053  102787  584803
Bulgarian  11172   85878  536977
Czech      11087   79022  498216
Estonian   17872   78792  545984
Hungarian   8813   79814  575219
Romanian   16684  103704  603868
Slovene    14938   91336  541461

4. Le Monde Diplomatique in English and Fench:

     lines   words  characters   text

     116     956    6410         LEMAE1.txt
     133     941    7457         LEMAF1.txt

5. From research with Dr Maria Cristina Borba (Rio Grande, Brazil).
Alice in Wonderland in English, 2 Brazilian-Portuguese translations
(one for adults, one for children), and a Catalan translation (MARIST).

                         CARROLL       LEITE     SEVCENKO       MARIST

File length (bytes)      204,288     148,889      150,235      143,055

Running words (tokens)    31,731      25,348       26,245       25,566
Different words (types)    3,417       3,896        3,614        4,400
type/token ratio (mean)    44.99%      51.61%       51.25%       51.19%
ave. word length (letters)  3.63        4.36         4.31         4.16

Best
Ramesh

Ramesh Krishnamurthy
Honorary Research Fellow, University of Birmingham;
Honorary Research Fellow, University of Wolverhampton;
Consultant, Cobuild and Bank of English Corpus, Collins Dictionaries.


On Thu, Apr 25, 2002 at 04:56:15PM +0100, Yorick Wilks wrote:
> t=iso-8859-1
> Content-Transfer-Encoding: 8bit
> X-checked-clean: by exiscan on alf
> X-Scanner: 5832cd47e7f9ea43fe3a076fe9cb1a70 http://tjinfo.uib.no/virus.html
> X-Spam-Flag: NO UIB: 0 hits, 8 required;
> X-Spam-Report: spamassassin found:
> Sender: owner-corpora at lists.uib.no
> Precedence: bulk
> Status: O
> Content-Length: 3684
> Lines: 114
>
>
> Isnt there some  (minor) confusion here? If the question really is relative TEXT
> length,
> then nothing to do with word counts will settle it--what matters is character
> counts, since word length
> varies considerably between languages. The table showed 1984 in Estonian as
> having far fewer word
> tokens in it than the  English original, but I'd bet theyre much longer
> ones--how about the texts then??
> I have no parallel texts with English and E. European languages but I do with
> the four major W. European ones
> and the English pages are shorter in every case.
> Yorick Wilks
>
>
>
>
>
>
> James L. Fidelholtz" wrote:
>
> > Andrew and Spela:
> >         Just a word of caution: studies like Spela's provide interesting
> > and suggestive data, but figures will surely vary, depending on the
> > translator, topic, etc. [all the usual sociolinguistic caveats apply
> > here] (and note Jean's contribution, with varying rates).  I was
> > coauthor of a study comparing English and Spanish, which basically tried
> > to get Spanish to fit into the standard readability curves in a fairly
> > simple way.  We were only partially successful (the counts were
> > hand-done by yours truly, featuring a variety of types of text,
> > pseudo-randomly sampled, and especially translations from one
> > language to the other, as well as translations from 3rd languages
> > [French & German] into each).  To the best of my recollection (I could
> > look up the exact figures if anyone is hot for them), our results for
> > Spanish-English were rather close to Jean's for French (I assume his
> > were on large amounts of text done by computer--if this holds up [not
> > surprising, given the close relationship of French and Spanish], it may
> > indicate that, for this kind of data, not such a huge amount of text is
> > really necessary).
> >
> > On Wed, 24 Apr 2002, spela vintar wrote:
> >
> > >
> > >Hi Andrew,
> > >
> > >for Eastern-European languages you can compare the lengths of Orwell's 1984
> > >and its translations that were collected within the Multext-East project.
> > >The original Multext project (http://www.lpl.univ-aix.fr/projects/multext/)
> > >should provide the same for English, German, French, Spanish etc., however I
> > >wasn't able to find it on their homepage at first glance...
> > >
> > >Best,
> > >Spela
> > >
> > >http://nl.ijs.si/ME/CD/docs/mte-d21f/node8.html
> > >//////////////
> > >...
> > >Below we give an estimate for the number of words, by language. The
> > >wordcounts were produced by removing the SGML tags from the texts and then
> > >using a 'wc'-like procedure.
> > >
> > >  English
> > >            104.302
> > >  Romanian
> > >            101.460
> > >  Slovene
> > >             91.619
> > >  Bulgarian
> > >             87.235
> > >  Czech
> > >             80.366
> > >  Hungarian
> > >             81.147
> > >  Estonian
> > >             79.334
> > >
> > >
> > >Andrew Bredenkamp wrote:
> > >
> > >> Hello everyone,
> > >>
> > >> Does anyone know where I can find a list of relative text length?
> > >>
> > >> Taking one language as an index (100), I would like a list of the (other)
> > >> main European languages - e.g. (made up):
> > >>
> > >> Spanish: 100
> > >> English: 105
> > >> French: 110
> > >> German: 85
> > >>
> > >> ... etc.
> > >>
> > >> Thanks a lot in advance for any help you can give me.
> > >>
> > >> Cheers,
> > >> Andrew
> > >> =========================================
> > >> Andrew Bredenkamp
> > >> acrolinx GmbH
> > >> URL:            www.acrolinx.com
> > >>
> > >>



More information about the Corpora mailing list