[Corpora-List] EAC-TM - Another freely available translation memory, in 26 languages
Ralf Steinberger
ralf.steinberger at jrc.ec.europa.eu
Wed Feb 6 00:25:47 UTC 2013
EAC-TM is a translation memory (sentences and their manually produced translations) in 26 languages. It is a multilingual parallel corpus covering 325 language pairs.
Size: Up to 5100 translation units per language; 78,000 in total.
Languages: All 325 language pairs involving the following 26 languages:
Bulgarian, Czech, Danish, Dutch, English, Estonian, German,
Greek, Finnish, French, Croatian, Hungarian, Icelandic, Italian,
Latvian, Lithuanian, Maltese, Norwegian, Polish, Portuguese,
Romanian, Slovak, Slovene, Spanish, Swedish and Turkish.
URL: http://langtech.jrc.ec.europa.eu/EAC-TM.html
Creator: EC Directorate for Education and Culture <http://ec.europa.eu/dgs/education_culture/> (EAC <http://ec.europa.eu/dgs/education_culture/> ) and JRC
WHAT IS EAC-TM
EAC-TM was produced by translating the English language form data for the EAC’s Lifelong Learning Programme (LLP) and the Youth in Action Programme of the European Commission’s Directorate General for Education and Culture (EAC). The results of the translation were stored in 25 bilingual translation memories. DG EAC and the JRC post-processed these by cleaning the data and by producing one alignment for all 26 languages, resulting in parallel data for 325 language pairs.
The underlying documents are thus form data in the field of education and culture.
The EAC Translation Memory <http://langtech.jrc.ec.europa.eu/EAC-TM.html> is much smaller than the other multilingual resources distributed in the past by the European Commission’s Joint Research Centre (JRC). Its main advantages are that (a) it covers even more languages and (b) it is based on texts from a very different domain (education and culture).
MOTIVATION FOR THIS RELEASE
The public data release is in line with the general effort of the European Commission to support multilingualism, language diversity and the re-use of Commission information. It follows the release of the JRC-Acquis <http://langtech.jrc.ec.europa.eu/JRC-Acquis.html> parallel corpus in 2006 (over 1 billion words in 22 languages), of the DGT-TM Translation Memory <http://langtech.jrc.ec.europa.eu/DGT-TM.html> in 2007 and 2011, the multilingual named entity resource JRC-Names <http://langtech.jrc.ec.europa.eu/JRC-Names.html> in 2011, the multi-label classification software JRC EuroVoc Indexer JEX <http://langtech.jrc.ec.europa.eu/Eurovoc.html> in 22 languages in 2012,the ECDC-TM Translation Memory <http://ipsc.jrc.ec.europa.eu/?id=782> in 25 languages in 2012, the DGT-Acquis <http://ipsc.jrc.ec.europa.eu/?id=783> parallel corpus in 23 languages in 2012, and further smaller multilingual resources. See http://ipsc.jrc.ec.europa.eu/?id=61 for more information on these resources.
WHAT EAC-TM CAN BE USED FOR
EAC-TM can be fed into translation memory software to support human translators in their work. As it is a large parallel corpus in electronic form, it can furthermore be used by specialists in computational linguistics to train statistical machine translation software, to generate multilingual dictionaries, to train and test multilingual information extraction software, and more.
WHAT NEXT?
The JRC and collaborating services of the European Commission hope to release further large-scale linguistic resources in the future.
<http://langtech.jrc.ec.europa.eu/RS.html> Ralf Steinberger & Mohamed Ebrahim
European Commission - Joint Research Centre (JRC)
21027 Ispra (VA), Italy
URL – Applications: <http://emm.newsbrief.eu/overview.html> http://emm.newsbrief.eu/overview.html
URL – Publications on the science behind them: <http://langtech.jrc.ec.europa.eu/JRC_Publications.html> http://langtech.jrc.ec.europa.eu/JRC_Publications.html
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20130206/3942d2fd/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list