[Corpora-List] starting a machine translation project

Wed Sep 13 10:57:50 UTC 2006

Dear Nano,

I was one of the developers of the Moses open source toolkit during  
the Johns Hopkins CLSP summer workshop this year.  Please see http:// 
www.statmt.org/moses/ for more information about the project.

> I got some newbie-like questions:
> - Our main purpose is to make an opensource English-to-Indonesian MT,
> can we use Moses for this purpose, or perhaps Moses is specific for
> Foreign-to-English translation only?

Moses is not specific to one direction; you can use it to do both  
English-to-Indonesian and Indonesian-to-English.  All you need is an  
English-Indonesian parallel corpus.

> - AFAIK, we have to provide bilingual corpus to do the statistical
> training. Some articles mentioned about "phrase translation". Do we
> need to provide some kind of phrase table, or perhaps it is generated
> automatically by a special program?

The basic training procedure for statistical machine translation  
systems is the following
(1) Assemble a bilingual training corpus, and align it on the  
sentence-level.
(2) Create word alignments for each sentence pair in your training  
corpus using the Giza++ implementation of the IBM Models.
(3) Extract phrase pairs and their translation probabilities from the  
word-alignments.
(4) Train a language model from a large monolingual corpus for the  
target language.  This can be done with the SRI language modeling  
toolkit.
(5) Tune the weights of the parameters of your statistical  
translation model by applying your decoder to a development set and  
comparing its n-best translations against a reference set.

Moses provides support scripts for extracting phrase-pairs from word  
alignments, and for tuning the weights with minimum error rate training.

Moses provides additional facilities beyond standard phrase-based  
models, in that it allows additional layers of representation to be  
integrated in the translation process.  For instance, rather than  
represent phrases as sequences of words, they can now be represented  
as sequences of 'factors' such as words, part-of-speech, stems,  
morphological info, etc.

In order to take advantage of these capabilities, you must have  
additional tools that allow you to tag your parallel corpus with the  
factors that you want to use.   Factors can be used on the source  
side, or the target side, or both, so you might be able to take  
advantage of existing tools for English if you do not have any such  
tools for Indonesian.

> - If we can't use Moses, do you have some guidance for us, perhaps
> like some pointers to opensource toolkit?

If you are able to assemble a parallel corpus, then I would encourage  
you to try Moses.  It is a straightforward way of developing a  
translation system quickly, and allows a range of possible extensions  
and improvements using the factor-based representations.

> - As a rough prediction, how many months is it going take to develop
> an "early-version" of English-to-ForeignLanguage MT ?

If you have a parallel corpus in place, then it should just take a  
few weeks to familiarize yourself with Moses, and to read the  
relevant background literature.  At that point you will be able to  
produce a baseline system, analyze the output, and think of many  
possible ways of improving it.

Good luck!

Chris Callison-Burch