[Corpora-List] ECDC-TM - A freely available translation memory in 25 languages

Ralf Steinberger ralf.steinberger at jrc.ec.europa.eu
Fri Oct 19 10:03:01 UTC 2012


ECDC-TM is a translation memory (sentences and their manually produced
translations) in 25 languages. It is a multilingual parallel corpus covering
300 language pairs.
 
Size:       Up to 2500 translation units per language; 32,000 in total.
 
Languages:  All 300 language pairs involving the following 25 languages: 
 
            Bulgarian, Czech, Danish, Dutch, English, Estonian, German,
Greek, 
Finnish, French, Hungarian, Icelandic, Italian, Latvian, Lithuanian, 
Maltese, Norwegian, Polish, Portuguese, Romanian, Slovak, Slovene, 
Spanish, Swedish and Turkish.
            
URL:        http://langtech.jrc.ec.europa.eu/ECDC-TM.html
Creator:    European Centre for Disease Prevention and Control (ECDC
<http://www.ecdc.europa.eu/> ) and JRC
 
 
WHAT IS ECDC-TM
 
ECDC-TM was produced by professionally translating the English language web
pages of the European Centre for Disease Prevention and Control (ECDC), an
EU agency based in Stockholm. The results of the translation were stored in
24 bilingual translation memories. The JRC post-processed these by cleaning
the data and by producing one alignment for all 25 languages, resulting in
parallel data for 300 language pairs.
 
The major part of the documents talks about health-related topics (anthrax,
botulism, cholera, dengue fever, hepatitis, etc.), but some of the web pages
also describe the organisation ECDC (e.g. its organisation, job
opportunities) and its activities (e.g. epidemic intelligence,
surveillance).
 
The ECDC Translation Memory <http://langtech.jrc.ec.europa.eu/ECDC-TM.html>
is much smaller than the other multilingual resources distributed in the
past by the European Commission's Joint Research Centre (JRC). Its main
advantages are that (a) it covers even more languages and (b) it is based on
texts from a very different domain (Public Health).
 
 
MOTIVATION FOR THIS RELEASE
 
The public data release is in line with the general effort of the European
Commission to support multilingualism, language diversity and the re-use of
Commission information. It follows the release of the JRC-Acquis
<http://langtech.jrc.ec.europa.eu/JRC-Acquis.html>  parallel corpus in 2006
(over 1 billion words in 22 languages), of the DGT-TM Translation Memory
<http://langtech.jrc.ec.europa.eu/DGT-TM.html>  in 2007 and 2011, the
multilingual named entity resource JRC-Names
<http://langtech.jrc.ec.europa.eu/JRC-Names.html>  in 2011, the multi-label
classification software JRC EuroVoc Indexer JEX
<http://langtech.jrc.ec.europa.eu/Eurovoc.html>  in 22 languages and further
smaller multilingual resources. See
http://langtech.jrc.ec.europa.eu/JRC_Resources.html for more information on
these resources.
 
 
WHAT ECDC-TM CAN BE USED FOR
                
ECDC-TM can be fed into translation memory software to support human
translators in their work. As it is a large parallel corpus in electronic
form, it can furthermore be used by specialists in computational linguistics
to train statistical machine translation software, to generate multilingual
dictionaries, to train and test multilingual information extraction
software, and more.
 
 
WHAT NEXT?
 
The JRC and collaborating services of the European Commission plan to
release further large-scale linguistic resources in the near future. These
include another 25-language translation memory and a paragraph-aligned
full-text parallel corpus in 23 languages.
 
 
Ralf Steinberger & Mohamed Ebrahim
European Commission - Joint Research Centre (JRC)
21027 Ispra (VA), Italy
URL - Applications:  <http://emm.newsbrief.eu/overview.html>
http://emm.newsbrief.eu/overview.html
URL - The science behind them:  <http://langtech.jrc.ec.europa.eu/>
http://langtech.jrc.ec.europa.eu/ 
 
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20121019/f2417316/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list