[Corpora-List] starting a machine translation project

Joerg Tiedemann tiedeman at let.rug.nl
Wed Sep 13 12:08:15 UTC 2006



> Based on your experience, is it a minimum number of words or sentences
> in a corpus to produce a basic translation service? If the purpose is
> for daily language use, is it enough to use an English-Indonesian
> Bible as a corpus?


you could include translated KDE messages from the OPUS corpus to have at 
least some more up-to-date data (http://omilia.uio.no/opus/kde.html)

download
http://omilia.uio.no/opus/KDE/id.tar.gz
http://omilia.uio.no/opus/KDE/en.tar.gz
and the sentence alignmnts in
http://omilia.uio.no/opus/KDE/enid.ces.gz

(all other languages are alignd to indonesian as well ... just download 
the corresponding files)

the KDE text is of course not very exciting and maybe not exactly what you 
might need for the SMT training (it's mainly terms and not so many 
complete sentences). but you could try.
(it's very small as well but at leasr you have many language pairs)

good luck!


Jörg

***********/\/\/\/\/\/\/\/\/\/\/\************************************
**  Jörg Tiedemann                 tiedeman at let.rug.nl             **
**  Alfa-Informatica               http://www.let.rug.nl/~tiedeman **
**  Rijksuniversiteit Groningen     Harmoniegebouw, room 1311-429  **
**  Oude Kijk in 't Jatstraat 26    phone: +31 (0)50-363 5935      **
**  9712 EK Groningen               fax:   +31 (0)50-363 6855      **
*************************************/\/\/\/\/\/\/\/\/\/\/\**********


More information about the Corpora mailing list