[Corpora-List] starting a machine translation project
Chris Callison-Burch
callison-burch at ed.ac.uk
Wed Sep 13 10:57:50 UTC 2006
Dear Nano,
I was one of the developers of the Moses open source toolkit during
the Johns Hopkins CLSP summer workshop this year. Please see http://
www.statmt.org/moses/ for more information about the project.
> I got some newbie-like questions:
> - Our main purpose is to make an opensource English-to-Indonesian MT,
> can we use Moses for this purpose, or perhaps Moses is specific for
> Foreign-to-English translation only?
Moses is not specific to one direction; you can use it to do both
English-to-Indonesian and Indonesian-to-English. All you need is an
English-Indonesian parallel corpus.
> - AFAIK, we have to provide bilingual corpus to do the statistical
> training. Some articles mentioned about "phrase translation". Do we
> need to provide some kind of phrase table, or perhaps it is generated
> automatically by a special program?
The basic training procedure for statistical machine translation
systems is the following
(1) Assemble a bilingual training corpus, and align it on the
sentence-level.
(2) Create word alignments for each sentence pair in your training
corpus using the Giza++ implementation of the IBM Models.
(3) Extract phrase pairs and their translation probabilities from the
word-alignments.
(4) Train a language model from a large monolingual corpus for the
target language. This can be done with the SRI language modeling
toolkit.
(5) Tune the weights of the parameters of your statistical
translation model by applying your decoder to a development set and
comparing its n-best translations against a reference set.
Moses provides support scripts for extracting phrase-pairs from word
alignments, and for tuning the weights with minimum error rate training.
Moses provides additional facilities beyond standard phrase-based
models, in that it allows additional layers of representation to be
integrated in the translation process. For instance, rather than
represent phrases as sequences of words, they can now be represented
as sequences of 'factors' such as words, part-of-speech, stems,
morphological info, etc.
In order to take advantage of these capabilities, you must have
additional tools that allow you to tag your parallel corpus with the
factors that you want to use. Factors can be used on the source
side, or the target side, or both, so you might be able to take
advantage of existing tools for English if you do not have any such
tools for Indonesian.
> - If we can't use Moses, do you have some guidance for us, perhaps
> like some pointers to opensource toolkit?
If you are able to assemble a parallel corpus, then I would encourage
you to try Moses. It is a straightforward way of developing a
translation system quickly, and allows a range of possible extensions
and improvements using the factor-based representations.
> - As a rough prediction, how many months is it going take to develop
> an "early-version" of English-to-ForeignLanguage MT ?
If you have a parallel corpus in place, then it should just take a
few weeks to familiarize yourself with Moses, and to read the
relevant background literature. At that point you will be able to
produce a baseline system, analyze the output, and think of many
possible ways of improving it.
Good luck!
Chris Callison-Burch
More information about the Corpora
mailing list