[Corpora-List] news from the OPUS corpus
Joerg Tiedemann
j.tiedemann at rug.nl
Tue May 20 10:30:05 UTC 2008
There are some new items in the OPUS corpus which might be interesting
for corpora readers:
EMEA - a parallel corpus of documents provided by the
European Medicines Agency (22 languages, 17 million sentences)
http://www.let.rug.nl/tiedeman/OPUS/EMEA.php
Machine-annotated Dutch Treebanks for Europarl and OpenSubtitles
http://www.let.rug.nl/~vannoord/trees/Treebank/Machine/europarl3/COMPACT/
http://www.let.rug.nl/~vannoord/trees/Treebank/Machine/OpenSubtitles/COMPACT/
A word alignment database with data from Europarl, OpenSubtitles and
EUconst linked with the multilingual corpus search interface:
http://www.let.rug.nl/tiedeman/OPUS/lex.php
Furthermore, the tokenization has been improved for the Europarl corpus,
version 3 (especially for Dutch) and for the OpenSubtitle corpus (also
better sentence splitting). The alignment for subtitles should be better
now as well.
There is a little Perl script to convert OPUS corpora to GIZA++/Moses
format: http://www.let.rug.nl/tiedeman/OPUS/tools/opus2moses.pl
More information can be found at http://www.let.rug.nl/tiedeman/OPUS/tools
All the corpora have been annotated and aligned automatically. No manual
corrections have been carried out.
Any feedback is very welcome. I'm also interested in information about
projects/people using OPUS data and tools. Feel free to send me comments
about the quality of the data especially if you did some evaluations of
the sentence alignment for certain subcorpora etc.
--
Jörg
***********/\/\/\/\/\/\/\/\/\/\/\************************************
** Jörg Tiedemann j.tiedemann at rug.nl **
** Alfa-Informatica http://www.let.rug.nl/~tiedeman **
** Rijksuniversiteit Groningen Harmoniegebouw, room 1311-429 **
** Postbus 716 phone: +31 (0)50-363 5935 **
** 9700 AS Groningen fax: +31 (0)50-363 6855 **
*************************************/\/\/\/\/\/\/\/\/\/\/\**********
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list