[Corpora-List] Released: Version 5 of European Parliament Proceedings Parallel Corpus
Philipp Koehn
pkoehn at inf.ed.ac.uk
Wed Jan 20 17:02:35 UTC 2010
European Parliament Proceedings Parallel Corpus 1996-2009
On 20 January 2010 we released a further expanded and improved
version of the corpus. The corpus is available as a source release with
the document files and a sentence aligner, and parallel corpora of
language pairs that include English.
URL: http://www.statmt.org/europarl/
The Europarl parallel corpus is extracted from the proceedings of the
European Parliament. It includes versions in 11 European languages:
Romanic (French, Italian, Spanish, Portuguese), Germanic (English,
Dutch, German, Danish, Swedish), Greek and Finnish.
Changes since v3 (v4 was only released partially for WMT 2009)
* added 11/2007 - 10/2009 data
* now up to 55 million words per language
* further refined pre-processing, cleaning
The work was in part supported by the EuroMatrixPlus project funded by
the European Commission (7th Framework Programme).
For questions, please contact:
Philipp Koehn
University of Edinburgh
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list