[Corpora-List] DGT-TM - A freely available large-scale translation memory in 22 languages

Fri Apr 13 13:47:26 UTC 2012

DGT-TM is a translation memory (sentences and their manually produced
translations) in 22 languages. 

Size:       About 3 million sentences for most languages, 57 million in
total.

Languages:  All 231 language pairs involving the following 22 languages: 

           Bulgarian, Czech, Danish, Dutch, English, Estonian, German,
Greek,
           Finnish, French, Hungarian, Italian, Latvian, Lithuanian,
Maltese, 
            Polish, Portuguese, Romanian, Slovak, Slovene, Spanish and
Swedish.

URL:         <http://langtech.jrc.ec.europa.eu/DGT-TM.html>
http://langtech.jrc.ec.europa.eu/DGT-TM.html
Creator:    European Commission - Directorate General for Translation (
<http://ec.europa.eu/dgs/translation/index_en.htm> DGT)

The first version of DGT-TM (19 million sentences, or ‘Translation Units’)
was released in 2007. This collection now triples in size through the
addition of a further 38 million sentences. For the future, it is planned to
release new data annually.

WHAT IS DGT-TM

The ‘ <http://europa.eu/abc/eurojargon/index_en.htm> Acquis Communautaire’
is the entire body of European legislation, comprising all the treaties,
regulations and directives adopted by the European Union (EU). Since each
new country joining the EU is required to accept the whole Acquis
Communautaire, this body of legislation has been translated into 22 official
languages. For the 23rd official EU language, Irish, the Acquis is not
translated on a regular basis; which is why DGT-TM does not include data in
Irish. The Acquis Communautaire was split into sentences and aligned
automatically at sentence level, resulting in the DGT translation memory,
DGT-TM. The text data is accompanied by software that allows to extract all
sentences and their translations for any of the 231 possible language pair
combinations. 

MOTIVATION FOR THIS RELEASE

The public data release is in line with the general effort of the European
Commission to support multilingualism, language diversity and the re-use of
Commission information. It follows the release of the JRC-Acquis parallel
corpus in 2006 (over 1 billion words in 22 languages), of the DGT-TM
Translation Memory in 2007, the multilingual named entity resource JRC-Names
in 2011, and further smaller multilingual resources. See
http://langtech.jrc.ec.europa.eu/JRC_Resources.html for more information on
these resources.

WHAT DGT-TM CAN BE USED FOR

DGT-TM can be fed into translation memory software to support human
translators in their work. As it is a large parallel corpus in electronic
form, it can furthermore be used by specialists in computational linguistics
to train statistical machine translation software, to generate multilingual
dictionaries, to train and test multilingual information extraction
software, and more.

MORE INFORMATION ON DGT-TM 

At http://langtech.jrc.ec.europa.eu/, you find more information on the JRC’s
multilingual language technology activity, download links for DGT-TM, as
well as a page pointing to other multilingual resources. For details on
DGT-TM, you can read:

     Steinberger Ralf, Andreas Eisele, Szymon Klocek, Spyridon Pilos 
      & Patrick Schlüter (2012). 

<http://langtech.jrc.ec.europa.eu/Documents/2012_LREC_DGT-TM_Final.pdf>
DGT-TM: A freely Available Translation Memory in 22 Languages. 
      Proceedings of the 8th international conference on Language 
      Resources and Evaluation (LREC'2012), Istanbul, 21-27 May 2012. 

<http://langtech.jrc.ec.europa.eu/Documents/2012_LREC_DGT-TM_Final.pdf>
http://langtech.jrc.ec.europa.eu/Documents/2012_LREC_DGT-TM_Final.pdf

WHAT NEXT?

The JRC and collaborating services of the European Commission plan to
release further large-scale linguistic resources in the near future. The JRC
EuroVoc Indexer Software JEX to multi-label categorise documents
automatically according to the large-scale subject domain classification
scheme EuroVoc will be released in May 2012. 

Ralf Steinberger  
European Commission - Joint Research Centre (JRC)
21027 Ispra (VA), Italy
URL – Applications:  <http://emm.newsbrief.eu/overview.html>
http://emm.newsbrief.eu/overview.html
URL – The science behind them:  <http://langtech.jrc.ec.europa.eu/>
http://langtech.jrc.ec.europa.eu/ 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20120413/7e182256/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora