[Corpora-List] OPUS-corpus news
Joerg Tiedemann
tiedeman at let.rug.nl
Tue Oct 30 15:10:51 UTC 2007
There is a new parallel corpus in OPUS: OpenSub - a collection of
movie subtitles in various languages. For details look at the OPUS
homepage (new location!): http://urd.let.rug.nl/tiedeman/OPUS/
The subtitle corpus is aligned and freely available. In return I would
like to get some help with the following issues:
- evaluation: it's hard for me to judge the quality of tokenization
and sentence alignment for all language pairs
- tools for automatic correction: most subtitles are scanned from
DVD's and, therefore, they often contain OCR errors; I would like to
use available tools to correct these errors in a batch run
(preferably I'd like to use something that I can apply to different
languages or which I can easily train on available data)
- I would also like to get other kind of feedback about the data and
like to hear of people actually using them
Thanks in advance!
Jörg
***********/\/\/\/\/\/\/\/\/\/\/\************************************
** Jörg Tiedemann tiedeman at let.rug.nl **
** Alfa-Informatica http://www.let.rug.nl/~tiedeman **
** Rijksuniversiteit Groningen Harmoniegebouw, room 1311-429 **
** Oude Kijk in 't Jatstraat 26 phone: +31 (0)50-363 5935 **
** 9712 EK Groningen fax: +31 (0)50-363 6855 **
*************************************/\/\/\/\/\/\/\/\/\/\/\**********
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list