Corpora: Q: Text length differences in parallel text

Ralf Steinberger ralf.steinberger at jrc.it
Mon Oct 1 15:32:36 UTC 2001


Hello,

we are interested in finding out about the average text length difference
between texts and their translations (parallel texts). We would be
interested in data for all eleven official European Union languages, but
especially for the language pair English - Spanish. We want to use this (and
further) information to automatically identify translations of a given text
in a larger text collection.

Text length differences could be expressed either by using the number of
words or the number of characters. In our own sublanguage corpus, Spanish
texts use about 13% more characters than their English equivalences, but we
would like to have information pertaining to texts other than our own.

Thanks in advance for any help with this. I shall send a summary of the
responses to the list.

Ralf


Ralf Steinberger
European Commission
Joint Research Centre - Ispra site (http://www.jrc.it/langtech/)



More information about the Corpora mailing list