[Corpora-List] starting a machine translation project

Philipp Koehn pkoehn at inf.ed.ac.uk
Wed Sep 13 10:43:06 UTC 2006


Hi Nano,

> We want to start an English-Indonesian MT project. We found that
> there is an opensource MT toolkit, "Moses", in http://www.statmt.org/

> I don't know much about machine translation. From some articles I've
> been reading, it looks like Statistical translation method is a rather
> easy but yet produce a reasonable result.
>
> I got some newbie-like questions:
> - Our main purpose is to make an opensource English-to-Indonesian MT,
> can we use Moses for this purpose, or perhaps Moses is specific for
> Foreign-to-English translation only?
While documentation is written with the assumption of foreign-English
translation, you may use it for any language direction. We have built
many MT systems with target languages other than English.

> - AFAIK, we have to provide bilingual corpus to do the statistical
> training. Some articles mentioned about "phrase translation". Do we
> need to provide some kind of phrase table, or perhaps it is generated
> automatically by a special program?
The phrase table is built during the training, using the tools provided
by Moses. All you need is a parallel corpora, sentence-aligned, and
with words seperated by spaces.

> - If we can't use Moses, do you have some guidance for us, perhaps
> like some pointers to opensource toolkit?
There a few other open source decoders, such as Phramer, and training
systems, such as Thot, and closed source decoders, such as Pharaoh.
I'd recommend Moses, but then I am of course biased.

> - As a rough prediction, how many months is it going take to develop
> an "early-version" of English-to-ForeignLanguage MT ?
Given the parallel corpus, about a day :)
Practically there will be many issues in preparing the data
in appropiate form etc. There may be spelling and font issues
with Indonesian, you may not have the data in the required
setence-aligned format, getting familiar with the tools may take
a while...

Regards,
Philipp Koehn



More information about the Corpora mailing list