Ressources: EAC-TM, Another freely available translation memory, in 26 languages

Thierry Hamon thierry.hamon at UNIV-PARIS13.FR
Wed Feb 6 08:57:54 UTC 2013

Date: Wed, 06 Feb 2013 01:25:47 +0100
From: Ralf Steinberger <ralf.steinberger at>
Message-id: <00df01ce0400$83389e40$89a9dac0$>

EAC-TM is a translation memory (sentences and their manually produced
translations) in 26 languages. It is a multilingual parallel corpus
covering 325 language pairs.

Size: Up to 5100 translation units per language; 78,000 in total.

Languages: All 325 language pairs involving the following 26 languages:

  Bulgarian, Czech, Danish, Dutch, English, Estonian, German, Greek,
  Finnish, French, Croatian, Hungarian, Icelandic, Italian, Latvian,
  Lithuanian, Maltese, Norwegian, Polish, Portuguese, Romanian, Slovak,
  Slovene, Spanish, Swedish and Turkish.


Creator: EC Directorate for Education and Culture (EAC ) and JRC


EAC-TM was produced by translating the English language form data for
the EAC’s Lifelong Learning Programme (LLP) and the Youth in Action
Programme of the European Commission’s Directorate General for Education
and Culture (EAC). The results of the translation were stored in 25
bilingual translation memories. DG EAC and the JRC post-processed these
by cleaning the data and by producing one alignment for all 26
languages, resulting in parallel data for 325 language pairs.

The underlying documents are thus form data in the field of education
and culture.

The EAC Translation Memory
( is much smaller than the
other multilingual resources distributed in the past by the European
Commission’s Joint Research Centre (JRC). Its main advantages are that
(a) it covers even more languages and (b) it is based on texts from a
very different domain (education and culture).


The public data release is in line with the general effort of the
European Commission to support multilingualism, language diversity and
the re-use of Commission information. It follows the release of the
JRC-Acquis ( parallel
corpus in 2006 (over 1 billion words in 22 languages), of the DGT-TM
Translation Memory ( in
2007 and 2011, the multilingual named entity resource JRC-Names
( in 2011, the
multi-label classification software JRC EuroVoc Indexer JEX
( in 22 languages in
2012,the ECDC-TM Translation Memory
( in 25 languages in 2012, the
DGT-Acquis ( parallel corpus in 23
languages in 2012, and further smaller multilingual resources. See for more information on these


EAC-TM can be fed into translation memory software to support human
translators in their work. As it is a large parallel corpus in
electronic form, it can furthermore be used by specialists in
computational linguistics to train statistical machine translation
software, to generate multilingual dictionaries, to train and test
multilingual information extraction software, and more.


The JRC and collaborating services of the European Commission hope to
release further large-scale linguistic resources in the future.

Ralf Steinberger & Mohamed
Ebrahim European Commission - Joint Research Centre (JRC) 21027 Ispra
(VA), Italy ( 

URL – Applications: (

URL – Publications on the science behind them:

Message diffuse par la liste Langage Naturel <LN at>
Informations, abonnement :
English version       : 
Archives                 :

La liste LN est parrainee par l'ATALA (Association pour le Traitement
Automatique des Langues)
Information et adhesion  :

More information about the Ln mailing list