[Corpora-List] starting a machine translation project

Gilles Serasset Gilles.Serasset at imag.fr
Thu Sep 14 12:20:02 UTC 2006

Dear Nano,

I feel very refreshing to see such an enthusiasm from a soon to be  
researcher in MT.

However, I hope you have A LOT of money to lose...

It seems that you missed one of the most crucial point in the answers.

IF YOU HAVE THE CORPUS then statistical MT is the way to go using  

This assertion is logically correct, because it is an implication  
with a false hypothesis.

As far as I heard from researchers who really built MT systems using  
SMT techniques, they were saying that a 50 MILLION word aligned  
corpus was the minimum for acceptable SMT results IN A SPECIFIC  
DOMAIN. Most advance researchers say now that 200 MILLION words are  
now the requirements.

As a comparison, the EUROPARL corpus (which is one of the most well  
know aligned corpus with French and English) contains 28 MILLION  
words. So it's about half the lower bound... It contains the european  
parliament discussions, translated in european languages. Even such  
verbose people do not produce more the 1-2 million words per year...

Of course, you can easily build a toy system, but it will remain a  
toy for a long time. And I can assure you that the bible, even with  
all its books, is very far from these pictures (31102 verses as far  
as I know...).

The suggestion that has been made on this list to develop a system  
using a transfer based approach will be by far a better way to go in  
the case of a language pair where the corpus is not yet available as  
building the corpus will require even more resources than building  
the analysis/transfer/generation rules.

Finally, as you are focussing on the Indonesian language, you should  
go and contact Pr Tang Enya Kong's team at Penang (UTMK, Universiti  
Sains Malaysia) who already has a pretty good mock up in English  
Malay based on EBMT (Example Based Machine Translation) methods.

Hope this will help you getting results in Machine Translation.


Gilles Sérasset,

On 14 sept. 06, at 12:16, Nano Surbakti wrote:

> Dear Friends,
> Thanks for all explanation and guidance, they're really helping us.
> We'll go on study and choose the best one regarding our situation,
> etc. The SMT method looks more promising now, but we haven't decide to
> choose it.
> We're trying to find sponsor, to hire experts and accelerate the
> project. The idea just appear some weeks ago, and we have very limited
> human resources in this field (only me have background in IT, and we
> have no language expert).
> Thanks again,
> --
> Nano Surbakti

Gilles Sérasset
BP 53 - F-38041 Grenoble Cedex 9
Phone: +33 4 76 51 43 80
Fax:   +33 4 76 44 66 75

More information about the Corpora mailing list