[Corpora-List] news from the OPUS corpus

Joerg Tiedemann j.tiedemann at rug.nl
Tue May 20 10:30:05 UTC 2008


There are some new items in the OPUS corpus which might be interesting 
for corpora readers:


EMEA - a parallel corpus of documents provided by the
European Medicines Agency (22 languages, 17 million sentences)
http://www.let.rug.nl/tiedeman/OPUS/EMEA.php

Machine-annotated Dutch Treebanks for Europarl and OpenSubtitles
http://www.let.rug.nl/~vannoord/trees/Treebank/Machine/europarl3/COMPACT/
http://www.let.rug.nl/~vannoord/trees/Treebank/Machine/OpenSubtitles/COMPACT/

A word alignment database with data from Europarl, OpenSubtitles and 
EUconst linked with the multilingual corpus search interface: 
http://www.let.rug.nl/tiedeman/OPUS/lex.php

Furthermore, the tokenization has been improved for the Europarl corpus, 
version 3 (especially for Dutch) and for the OpenSubtitle corpus (also 
better sentence splitting). The alignment for subtitles should be better 
now as well.

There is a little Perl script to convert OPUS corpora to GIZA++/Moses 
format: http://www.let.rug.nl/tiedeman/OPUS/tools/opus2moses.pl
More information can be found at http://www.let.rug.nl/tiedeman/OPUS/tools



All the corpora have been annotated and aligned automatically. No manual 
corrections have been carried out.

Any feedback is very welcome. I'm also interested in information about 
projects/people using OPUS data and tools. Feel free to send me comments 
about the quality of the data especially if you did some evaluations of 
the sentence alignment for certain subcorpora etc.


-- 
Jörg


***********/\/\/\/\/\/\/\/\/\/\/\************************************
**  Jörg Tiedemann                 j.tiedemann at rug.nl              **
**  Alfa-Informatica               http://www.let.rug.nl/~tiedeman **
**  Rijksuniversiteit Groningen    Harmoniegebouw, room 1311-429   **
**  Postbus 716                    phone: +31 (0)50-363 5935       **
**  9700 AS Groningen              fax:   +31 (0)50-363 6855       **
*************************************/\/\/\/\/\/\/\/\/\/\/\**********

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list