Ressources: Release 2014 of DGT-TM (parallel corpus in 24 languages)

Sat Sep 20 20:19:57 UTC 2014

Date: Thu, 18 Sep 2014 15:35:16 +0200
From: Ralf Steinberger <ralf.steinberger at jrc.ec.europa.eu>
Message-id: <001e01cfd345$61db2630$25917290$@jrc.ec.europa.eu>
X-url: https://ec.europa.eu/jrc/en/language-technologies/dgt-translation-memory
X-url: http://ec.europa.eu/dgs/translation/index_en.htm
X-url: http://europa.eu/abc/eurojargon/index_en.htm
X-url: https://ec.europa.eu/jrc/en/language-technologies

Readers on this list may be interested to hear that the 2014 release of
the DGT-Translation Memory is now available for download.

DGT-TM is an extraction of the translation memory of the European
Institutions for all official EU languages, produced by the European
Commission’s Directorate General for Translation (DGT) and distributed
by the Joint Research Centre (JRC). Translation memories are sentences
and their manually produced translations.

The new release is called DGT-TM-2014. It follows the previous releases,
DGT-TM (2007), DGT-TM-2011, DGT-TM-2012 and DGT-TM-2013. DGT-TM-2014
adds over eleven million translation units to the previous 73 million
translation units, resulting in almost 85 million translation units in
total (almost 1.4 billion words).

New features of DGT-TM-2014 are:

- Croatian (HR) data is made available for the first time.

- Significantly more data for languages with previously less coverage
  (e.g. Bulgarian, Irish, Maltese, Romanian);

- Mostly about 500K new translation units per language.

- Most documents of this release were translated in 2013, but it also
  contains previously unpublished documents from older years.

- More data for language pairs involving Maltese are available on
  request.

Languages: All 276 language pairs involving the following 24 languages:

Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, German,
Greek, Finnish, French, Irish, Hungarian, Italian, Latvian, Lithuanian,
Maltese, Polish, Portuguese, Romanian, Slovak, Slovene, Spanish and
Swedish.

URL:
https://ec.europa.eu/jrc/en/language-technologies/dgt-translation-memory   

Creator: European Commission - Directorate General for Translation
(http://ec.europa.eu/dgs/translation/index_en.htm - DGT)

WHAT IS DGT-TM

The ‘Acquis Communautaire’
(http://europa.eu/abc/eurojargon/index_en.htm) is the entire body of
European legislation, comprising all the treaties, regulations and
directives adopted by the European Union (EU). Since each new country
joining the EU is required to accept the whole Acquis Communautaire,
this body of legislation has been translated into 23 official
languages. For the 24th official EU language, Irish, the Acquis has not
been translated on a regular basis; which is why DGT-TM includes less
data in Irish. The Acquis Communautaire was split into sentences and
aligned automatically at sentence level, resulting in the DGT
translation memory, DGT-TM. The text data is accompanied by software
that allows to extract all sentences and their translations for any of
the 276 possible language pair combinations.

MOTIVATION FOR THIS RELEASE

The public data release is in line with the general effort of the
European Commission to support multilingualism, language diversity and
the re-use of Commission information. It follows the release of a number
of further multilingual data sets:

- the JRC-Acquis parallel corpus in 2006 (over 1 billion words in 22
  languages),

- the DGT-TM Translation Memory in 2007, 

- the multilingual named entity resource JRC-Names in 2011, 

- the multilingual multi-label classification tool (and accompanying
  text data) JRC EuroVoc Indexer (JEX) (22 languages) in 2012,

- the ECDC-TM Translation Memory in 2012 (domain: Public Health)

- the DGT-Acquis parallel corpus in 2012,

- the EAC-TM Translation Memory in 2013 (domain: Education and Culture),

- and further smaller multilingual resources. 

See https://ec.europa.eu/jrc/en/language-technologies for more
information on these resources.

WHAT DGT-TM CAN BE USED FOR

DGT-TM can be fed into translation memory software to support human
translators in their work. As it is a large parallel corpus in
electronic form, it can furthermore be used by specialists in
computational linguistics to train statistical machine translation
software, to generate multilingual dictionaries, to train and test
multilingual information extraction software, and more.

MORE INFORMATION ON DGT-TM 

At http://langtech.jrc.ec.europa.eu/JRC_Publications.html , you find
detailed publications on the JRC’s multilingual language technology
activity (http://langtech.jrc.ec.europa.eu/JRC_Publications.html). For
details specifically on DGT-TM, you can read:

      Steinberger Ralf, Andreas Eisele, Szymon Klocek, Spyridon Pilos &
      Patrick Schlüter (2012).  DGT-TM: A freely Available Translation
      Memory in 22 Languages.  Proceedings of the 8th international
      conference on Language Resources and Evaluation (LREC'2012),
      Istanbul, 21-27 May 2012.
      (http://langtech.jrc.ec.europa.eu/Documents/2012_LREC_DGT-TM_Final.pdf)

The following recent article compares all freely available Language
Technology resources distributed by the JRC and provides comparative
background information:

     Steinberger Ralf, Mohamed Ebrahim, Alexandros Poulis, Manuel
     Carrasco-Benitez, Patrick Schlüter, Marek Przybyszewski & Signe
     Gilbro (2014).  An overview of the European Union's highly
     multilingual parallel corpora.  Language Resources and Evaluation
     Journal (LRE).  DOI: 10.1007/s10579-014-9277-0.
     (http://langtech.jrc.ec.europa.eu/Documents/2014_08_LRE-Journal_JRC-Linguistic-Resources_Manuscript.pdf,
     http://link.springer.com/article/10.1007/s10579-014-9277-0>

WHAT NEXT?

The release of the very large new parallel corpus DCEP
(http://www.lrec-conf.org/proceedings/lrec2014/pdf/943_Paper.pdf -
Digital Corpus of the European Parliament) is pending.

Ralf Steinberger (http://langtech.jrc.ec.europa.eu/RS.html) European
Commission - Joint Research Centre (JRC) 21027 Ispra (VA), Italy

URL – Applications: http://emm.newsbrief.eu/overview.html

URL – Resources: https://ec.europa.eu/jrc/en/language-technologies

URL – Publications:
http://langtech.jrc.ec.europa.eu/JRC_Publications.html

-------------------------------------------------------------------------
Message diffuse par la liste Langage Naturel <LN at cines.fr>
Informations, abonnement : http://www.atala.org/article.php3?id_article=48
English version       : 
Archives                 : http://listserv.linguistlist.org/archives/ln.html
                                http://liste.cines.fr/info/ln

La liste LN est parrainee par l'ATALA (Association pour le Traitement
Automatique des Langues)
Information et adhesion  : http://www.atala.org/

ATALA décline toute responsabilité concernant le contenu des
messages diffusés sur la liste LN
-------------------------------------------------------------------------