Soft: ECDC-TM - A freely available translation memory in 25 languages

Thierry Hamon thierry.hamon at UNIV-PARIS13.FR
Sat Oct 20 13:46:28 UTC 2012

Date: Fri, 19 Oct 2012 12:03:01 +0200
From: Ralf Steinberger <ralf.steinberger at>
Message-id: <015501cdade0$eca55470$c5effd50$>

ECDC-TM is a translation memory (sentences and their manually produced
translations) in 25 languages. It is a multilingual parallel corpus
covering 300 language pairs.
Size: Up to 2500 translation units per language; 32,000 in total.
Languages: All 300 language pairs involving the following 25 languages:
Bulgarian, Czech, Danish, Dutch, English, Estonian, German, Greek,
Finnish, French, Hungarian, Icelandic, Italian, Latvian, Lithuanian,
Maltese, Norwegian, Polish, Portuguese, Romanian, Slovak, Slovene,
Spanish, Swedish and Turkish.
Creator:    European Centre for Disease Prevention and Control (ECDC
   ) and JRC
ECDC-TM was produced by professionally translating the English language
web pages of the European Centre for Disease Prevention and Control
(ECDC), an EU agency based in Stockholm. The results of the translation
were stored in 24 bilingual translation memories. The JRC post-processed
these by cleaning the data and by producing one alignment for all 25
languages, resulting in parallel data for 300 language pairs.
The major part of the documents talks about health-related topics
(anthrax, botulism, cholera, dengue fever, hepatitis, etc.), but some of
the web pages also describe the organisation ECDC (e.g. its
organisation, job opportunities) and its activities (e.g. epidemic
intelligence, surveillance).
The ECDC Translation Memory (
is much smaller than the other multilingual resources distributed in the
past by the European Commission's Joint Research Centre (JRC). Its main
advantages are that (a) it covers even more languages and (b) it is based on
texts from a very different domain (Public Health).
The public data release is in line with the general effort of the
European Commission to support multilingualism, language diversity and
the re-use of Commission information. It follows the release of the
JRC-Acquis ( parallel
corpus in 2006 (over 1 billion words in 22 languages), of the DGT-TM
Translation Memory ( in
2007 and 2011, the multilingual named entity resource JRC-Names
( in 2011, the
multi-label classification software JRC EuroVoc Indexer JEX
( in 22 languages and
further smaller multilingual resources. See for more information
on these resources.
ECDC-TM can be fed into translation memory software to support human
translators in their work. As it is a large parallel corpus in
electronic form, it can furthermore be used by specialists in
computational linguistics to train statistical machine translation
software, to generate multilingual dictionaries, to train and test
multilingual information extraction software, and more.
The JRC and collaborating services of the European Commission plan to
release further large-scale linguistic resources in the near
future. These include another 25-language translation memory and a
paragraph-aligned full-text parallel corpus in 23 languages.
Ralf Steinberger & Mohamed Ebrahim
European Commission - Joint Research Centre (JRC)
21027 Ispra (VA), Italy
URL - Applications:
URL - The science behind them: 

Message diffuse par la liste Langage Naturel <LN at>
Informations, abonnement :
English version       : 
Archives                 :

La liste LN est parrainee par l'ATALA (Association pour le Traitement
Automatique des Langues)
Information et adhesion  :

More information about the Ln mailing list