[Corpora-List] OPUS-corpus news

Joerg Tiedemann tiedeman at let.rug.nl
Tue Oct 30 15:10:51 UTC 2007


There is a new parallel corpus in OPUS: OpenSub - a collection of 
movie subtitles in various languages. For details look at the OPUS 
homepage (new location!): http://urd.let.rug.nl/tiedeman/OPUS/

The subtitle corpus is aligned and freely available. In return I would 
like to get some help with the following issues:

- evaluation: it's hard for me to judge the quality of tokenization 
  and sentence alignment for all language pairs 
- tools for automatic correction: most subtitles are scanned from 
  DVD's and, therefore, they often contain OCR errors; I would like to 
  use available tools to correct these errors in a batch run 
  (preferably I'd like to use something that I can apply to different 
   languages or which I can easily train on available data)
- I would also like to get other kind of feedback about the data and 
  like to hear of people actually using them

Thanks in advance!


Jörg


***********/\/\/\/\/\/\/\/\/\/\/\************************************
**  Jörg Tiedemann                 tiedeman at let.rug.nl             **
**  Alfa-Informatica               http://www.let.rug.nl/~tiedeman **  
**  Rijksuniversiteit Groningen     Harmoniegebouw, room 1311-429  **
**  Oude Kijk in 't Jatstraat 26    phone: +31 (0)50-363 5935      **
**  9712 EK Groningen               fax:   +31 (0)50-363 6855      **
*************************************/\/\/\/\/\/\/\/\/\/\/\**********

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list