[Corpora-List] ECDC-TM - A freely available translation memory in 25 languages - CORRECTION

Ralf Steinberger ralf.steinberger at jrc.ec.europa.eu
Mon Oct 22 09:11:41 UTC 2012


This is a correction to the announcement sent on Friday 19 October. The
email and the related ECDC-TM webpage contained wrong information regarding
the languages covered and regarding the statistics on the corpus. Thanks to
Raivis Skadiņš for pointing this out. I had mixed up the information of two
different corpora. 25 languages seems to be more than my little brain can
handle. ;-)    Please accept my apologies.
 
You find the new summary below. The web page has now also been corrected.
 
Ralf
 
==========
 
ECDC-TM is a translation memory (sentences and their manually produced
translations) in 25 languages. It is a multilingual parallel corpus covering
300 language pairs.
 
Size:       Up to 3900 translation units per language; 64,000 in total.
 
Languages:  All 300 language pairs involving the following 25 languages: 
 
            Bulgarian, Czech, Danish, Dutch, English, Estonian, German, 
Greek, Finnish, French, Irish, Hungarian, Icelandic, Italian, 
Latvian, Lithuanian, Maltese, Norwegian, Polish, Portuguese, 
Romanian, Slovak, Slovene, Spanish and Swedish.
            
URL:        http://langtech.jrc.ec.europa.eu/ECDC-TM.html
Creator:    European Centre for Disease Prevention and Control (ECDC
<http://www.ecdc.europa.eu/> ) and JRC
 
 
WHAT IS ECDC-TM
 
ECDC-TM was produced by professionally translating the English language web
pages of the European Centre for Disease Prevention and Control (ECDC), an
EU agency based in Stockholm. The results of the translation were stored in
24 bilingual translation memories. The JRC post-processed these by cleaning
the data and by producing one alignment for all 25 languages, resulting in
parallel data for 300 language pairs.
 
The major part of the documents talks about health-related topics (anthrax,
botulism, cholera, dengue fever, hepatitis, etc.), but some of the web pages
also describe the organisation ECDC (e.g. its organisation, job
opportunities) and its activities (e.g. epidemic intelligence,
surveillance).
 
The ECDC Translation Memory <http://langtech.jrc.ec.europa.eu/ECDC-TM.html>
is much smaller than the other multilingual resources distributed in the
past by the European Commission’s Joint Research Centre (JRC). Its main
advantages are that (a) it covers even more languages and (b) it is based on
texts from a very different domain (Public Health).
 
 
MOTIVATION FOR THIS RELEASE
 
The public data release is in line with the general effort of the European
Commission to support multilingualism, language diversity and the re-use of
Commission information. It follows the release of the JRC-Acquis
<http://langtech.jrc.ec.europa.eu/JRC-Acquis.html>  parallel corpus in 2006
(over 1 billion words in 22 languages), of the DGT-TM Translation Memory
<http://langtech.jrc.ec.europa.eu/DGT-TM.html>  in 2007 and 2011, the
multilingual named entity resource JRC-Names
<http://langtech.jrc.ec.europa.eu/JRC-Names.html>  in 2011, the multi-label
classification software JRC EuroVoc Indexer JEX
<http://langtech.jrc.ec.europa.eu/Eurovoc.html>  in 22 languages and further
smaller multilingual resources. See
http://langtech.jrc.ec.europa.eu/JRC_Resources.html for more information on
these resources.
 
 
WHAT ECDC-TM CAN BE USED FOR
                
ECDC-TM can be fed into translation memory software to support human
translators in their work. As it is a large parallel corpus in electronic
form, it can furthermore be used by specialists in computational linguistics
to train statistical machine translation software, to generate multilingual
dictionaries, to train and test multilingual information extraction
software, and more.
 
 
WHAT NEXT?
 
The JRC and collaborating services of the European Commission plan to
release further large-scale linguistic resources in the near future. These
include another 25-language translation memory and a paragraph-aligned
full-text parallel corpus in 23 languages.
 
 
Ralf Steinberger & Mohamed Ebrahim
European Commission - Joint Research Centre (JRC)
21027 Ispra (VA), Italy
URL – Applications:  <http://emm.newsbrief.eu/overview.html>
http://emm.newsbrief.eu/overview.html
URL – The science behind them:  <http://langtech.jrc.ec.europa.eu/>
http://langtech.jrc.ec.europa.eu/ 
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20121022/c6a8f771/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list