Corpora: Studies about proportion of words in languages ?

Jean Veronis Jean.Veronis at newsup.univ-mrs.fr
Tue Jun 6 14:26:44 UTC 2000


At 15:42 06/06/2000 +0200, Marcelo Sztrum wrote:
>Dear list members,
>
>Are there, do you know comparative and/or quantified studies about the
>proportion/ratio of words (words in writing corpora) of one language(s) to
>another(s) (i.e.: *** For every X (1000??) Spanish/English words, there is
>Y (about 700????) German words, etc.***)?

This was measured within the ARCADE project
(http://www.up.univ-mrs.fr/~veronis/arcade) for French/English.

The ratio of words between corresponding segments ranges from 1.08 to 1.16
depending on the texts, French being the longest. This was measured on a
corpus of ca. 1.5 M words manually aligned at the sentence level.

To appear soon (this summer):

Véronis, J. & Langlais, Ph. (2000). Evaluation of parallel text alignment
systems: The ARCADE project. In J. Véronis (Ed.), Parallel Text Processing:
Alignment and use of translation corpora (pp. 369-388). Dordrecht: Kluwer
Academic Publishers.

Jean Véronis
http://www.up.univ-mrs.fr/~veronis


PS: see also the bibliography on parallel texts at:
http://www.up.univ-mrs.fr/~veronis/biblios/ptp.html



More information about the Corpora mailing list