Soft: ECDC-TM - A freely available translation memory in 25 languages - CORRECTION

Tue Oct 23 19:43:11 UTC 2012

Date: Mon, 22 Oct 2012 11:11:41 +0200
From: Ralf Steinberger <ralf.steinberger at jrc.ec.europa.eu>
Message-id: <04f301cdb035$403fca10$c0bf5e30$@jrc.ec.europa.eu>
X-url: http://www.ecdc.europa.eu/

This is a correction to the announcement sent on Friday 19 October. The
email and the related ECDC-TM webpage contained wrong information
regarding the languages covered and regarding the statistics on the
corpus. Thanks to Raivis Skadiņš for pointing this out. I had mixed up
the information of two different corpora. 25 languages seems to be more
than my little brain can handle. ;-) Please accept my apologies.

You find the new summary below. The web page has now also been
corrected.

Ralf

==========

ECDC-TM is a translation memory (sentences and their manually produced
translations) in 25 languages. It is a multilingual parallel corpus
covering 300 language pairs.

Size: Up to 3900 translation units per language; 64,000 in total.

Languages: All 300 language pairs involving the following 25 languages:

       Bulgarian, Czech, Danish, Dutch, English, Estonian, German,
Greek, Finnish, French, Irish, Hungarian, Icelandic, Italian, Latvian,
Lithuanian, Maltese, Norwegian, Polish, Portuguese, Romanian, Slovak,
Slovene, Spanish and Swedish.

URL: http://langtech.jrc.ec.europa.eu/ECDC-TM.html

Creator: European Centre for Disease Prevention and Control (ECDC
http://www.ecdc.europa.eu/) and JRC

WHAT IS ECDC-TM

ECDC-TM was produced by professionally translating the English language
web pages of the European Centre for Disease Prevention and Control
(ECDC), an EU agency based in Stockholm. The results of the translation
were stored in 24 bilingual translation memories. The JRC post-processed
these by cleaning the data and by producing one alignment for all 25
languages, resulting in parallel data for 300 language pairs.

The major part of the documents talks about health-related topics
(anthrax, botulism, cholera, dengue fever, hepatitis, etc.), but some of
the web pages also describe the organisation ECDC (e.g. its
organisation, job opportunities) and its activities (e.g. epidemic
intelligence, surveillance).

The ECDC Translation Memory
(http://langtech.jrc.ec.europa.eu/ECDC-TM.html) is much smaller than the
other multilingual resources distributed in the past by the European
Commission’s Joint Research Centre (JRC). Its main advantages are that
(a) it covers even more languages and (b) it is based on texts from a
very different domain (Public Health).

MOTIVATION FOR THIS RELEASE

The public data release is in line with the general effort of the
European Commission to support multilingualism, language diversity and
the re-use of Commission information. It follows the release of the
JRC-Acquis (http://langtech.jrc.ec.europa.eu/JRC-Acquis.html) parallel
corpus in 2006 (over 1 billion words in 22 languages), of the DGT-TM
Translation Memory (http://langtech.jrc.ec.europa.eu/DGT-TM.html) in
2007 and 2011, the multilingual named entity resource JRC-Names
(http://langtech.jrc.ec.europa.eu/JRC-Names.html) in 2011, the
multi-label classification software JRC EuroVoc Indexer JEX
(http://langtech.jrc.ec.europa.eu/Eurovoc.html) in 22 languages and
further smaller multilingual resources. See
http://langtech.jrc.ec.europa.eu/JRC_Resources.html for more information
on these resources.

WHAT ECDC-TM CAN BE USED FOR

ECDC-TM can be fed into translation memory software to support human
translators in their work. As it is a large parallel corpus in
electronic form, it can furthermore be used by specialists in
computational linguistics to train statistical machine translation
software, to generate multilingual dictionaries, to train and test
multilingual information extraction software, and more.

WHAT NEXT?

The JRC and collaborating services of the European Commission plan to
release further large-scale linguistic resources in the near
future. These include another 25-language translation memory and a
paragraph-aligned full-text parallel corpus in 23 languages.

Ralf Steinberger & Mohamed Ebrahim
European Commission - Joint Research Centre (JRC)
21027 Ispra (VA), Italy
URL – Applications:  <http://emm.newsbrief.eu/overview.html>
http://emm.newsbrief.eu/overview.html
URL – The science behind them:  <http://langtech.jrc.ec.europa.eu/>
http://langtech.jrc.ec.europa.eu/ 

-------------------------------------------------------------------------
Message diffuse par la liste Langage Naturel <LN at cines.fr>
Informations, abonnement : http://www.atala.org/article.php3?id_article=48
English version       : 
Archives                 : http://listserv.linguistlist.org/archives/ln.html
                                http://liste.cines.fr/info/ln

La liste LN est parrainee par l'ATALA (Association pour le Traitement
Automatique des Langues)
Information et adhesion  : http://www.atala.org/
-------------------------------------------------------------------------