[Corpora-List] Released: Version 5 of European Parliament Proceedings Parallel Corpus

Philipp Koehn pkoehn at inf.ed.ac.uk
Wed Jan 20 17:02:35 UTC 2010


European Parliament Proceedings Parallel Corpus 1996-2009

On 20 January 2010 we released a further expanded and improved
version of the corpus. The corpus is available as a source release with
the document files and a sentence aligner, and parallel corpora of
language pairs that include English.

URL: http://www.statmt.org/europarl/

The Europarl parallel corpus is extracted from the proceedings of the
European Parliament. It includes versions in 11 European languages:
Romanic (French, Italian, Spanish, Portuguese), Germanic (English,
Dutch, German, Danish, Swedish), Greek and Finnish.

Changes since v3 (v4 was only released partially for WMT 2009)

    * added 11/2007 - 10/2009 data
    * now up to 55 million words per language
    * further refined pre-processing, cleaning

The work was in part supported by the EuroMatrixPlus project funded by
the European Commission (7th Framework Programme).

For questions, please contact:
Philipp Koehn
University of Edinburgh

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list