Ressource: New release of DGT-TM (parallel corpus in 23 languages)

Thierry Hamon thierry.hamon at UNIV-PARIS13.FR
Fri Nov 9 20:57:45 UTC 2012


Date: Mon, 05 Nov 2012 17:04:04 +0100
From: Ralf Steinberger <ralf.steinberger at jrc.ec.europa.eu>
Message-id: <009f01cdbb6f$2dca2980$895e7c80$@jrc.ec.europa.eu>
X-url: http://langtech.jrc.ec.europa.eu/DGT-TM.html

DGT-TM is an extraction of the translation memory of the European
Institutions for all official EU languages, produced by the European
Commission’s Directorate General for Translation (DGT) and distributed
by the Joint Research Centre (JRC). Translation memories are sentences
and their manually produced translations.
 
The new release is called DGT-TM-2012. It follows the previous releases,
DGT-TM (2007) and DGT-TM-2011. DGT-TM-2012 adds over six million
translation units to the previous 57 million translation units,
resulting in almost 3.3 million sentences for most languages, 63 million
translation units in total.
 
New features of DGT-TM-2012 are:
 
· Small amounts of Irish data is now included for the first time;
· Significantly more data for the Bulgarian, Maltese and Romanian
  languages;
· Mostly about 285K new translation units per language.
 
Languages: All 253 language pairs involving the following 23 languages:
 
Bulgarian, Czech, Danish, Dutch, English, Estonian, German, Greek,
Finnish, French, Irish, Hungarian, Italian, Latvian, Lithuanian,
Maltese, Polish, Portuguese, Romanian, Slovak, Slovene, Spanish and
Swedish.
            
URL:        http://langtech.jrc.ec.europa.eu/DGT-TM.html
Creator:    European Commission - Directorate General for Translation
            (DGT, http://ec.europa.eu/dgs/translation/index_en.htm)
 
 
WHAT IS DGT-TM
 
The ‘Acquis Communautaire’
(http://europa.eu/abc/eurojargon/index_en.htm) is the entire body of
European legislation, comprising all the treaties, regulations and
directives adopted by the European Union (EU). Since each new country
joining the EU is required to accept the whole Acquis Communautaire,
this body of legislation has been translated into 22 official
languages. For the 23rd official EU language, Irish, the Acquis has not
been translated on a regular basis; which is why DGT-TM includes only
little data in Irish. The Acquis Communautaire was split into sentences
and aligned automatically at sentence level, resulting in the DGT
translation memory, DGT-TM. The text data is accompanied by software
that allows to extract all sentences and their translations for any of
the 253 possible language pair combinations.
 
MOTIVATION FOR THIS RELEASE
 
The public data release is in line with the general effort of the
European Commission to support multilingualism, language diversity and
the re-use of Commission information. It follows the release of the
JRC-Acquis parallel corpus in 2006 (over 1 billion words in 22
languages), of the DGT-TM Translation Memory in 2007, the multilingual
named entity resource JRC-Names in 2011, the multilingual multi-label
classification tool (and accompanying text data) JRC EuroVoc Indexer
(JEX) (22 languages), and further smaller multilingual resources. See
http://langtech.jrc.ec.europa.eu/JRC_Resources.html for more information
on these resources.
 
WHAT DGT-TM CAN BE USED FOR
                
DGT-TM can be fed into translation memory software to support human
translators in their work. As it is a large parallel corpus in
electronic form, it can furthermore be used by specialists in
computational linguistics to train statistical machine translation
software, to generate multilingual dictionaries, to train and test
multilingual information extraction software, and more.
 
MORE INFORMATION ON DGT-TM 

At http://langtech.jrc.ec.europa.eu/JRC_Publications.html , you find
detailed publications on the JRC’s multilingual language technology
activity. For details on DGT-TM, you can read:
 
Steinberger Ralf, Andreas Eisele, Szymon Klocek, Spyridon Pilos 
& Patrick Schlüter (2012). 
http://langtech.jrc.ec.europa.eu/Documents/2012_LREC_DGT-TM_Final.pdf 
DGT-TM: A freely Available Translation Memory in 22 Languages. 
Proceedings of the 8th international conference on Language 
Resources and Evaluation (LREC'2012), Istanbul, 21-27 May 2012. 
http://langtech.jrc.ec.europa.eu/Documents/2012_LREC_DGT-TM_Final.pdf
 
WHAT NEXT?
 
The JRC and collaborating services of the European Commission are
currently finalising the release of further large-scale linguistic
resources.
 
 
Ralf Steinberger http://langtech.jrc.ec.europa.eu/RS.html
European Commission - Joint Research Centre (JRC)
21027 Ispra (VA), Italy
URL – Applications:  http://emm.newsbrief.eu/overview.html 
URL – Resources: http://ipsc.jrc.ec.europa.eu/index.php?id=61  

-------------------------------------------------------------------------
Message diffuse par la liste Langage Naturel <LN at cines.fr>
Informations, abonnement : http://www.atala.org/article.php3?id_article=48
English version       : 
Archives                 : http://listserv.linguistlist.org/archives/ln.html
                                http://liste.cines.fr/info/ln

La liste LN est parrainee par l'ATALA (Association pour le Traitement
Automatique des Langues)
Information et adhesion  : http://www.atala.org/
-------------------------------------------------------------------------



More information about the Ln mailing list