[Corpora-List] starting a machine translation project
Gilles Serasset
Gilles.Serasset at imag.fr
Thu Sep 14 12:20:02 UTC 2006
Dear Nano,
I feel very refreshing to see such an enthusiasm from a soon to be
researcher in MT.
However, I hope you have A LOT of money to lose...
It seems that you missed one of the most crucial point in the answers.
IF YOU HAVE THE CORPUS then statistical MT is the way to go using
blahblahblah...
This assertion is logically correct, because it is an implication
with a false hypothesis.
As far as I heard from researchers who really built MT systems using
SMT techniques, they were saying that a 50 MILLION word aligned
corpus was the minimum for acceptable SMT results IN A SPECIFIC
DOMAIN. Most advance researchers say now that 200 MILLION words are
now the requirements.
As a comparison, the EUROPARL corpus (which is one of the most well
know aligned corpus with French and English) contains 28 MILLION
words. So it's about half the lower bound... It contains the european
parliament discussions, translated in european languages. Even such
verbose people do not produce more the 1-2 million words per year...
Of course, you can easily build a toy system, but it will remain a
toy for a long time. And I can assure you that the bible, even with
all its books, is very far from these pictures (31102 verses as far
as I know...).
The suggestion that has been made on this list to develop a system
using a transfer based approach will be by far a better way to go in
the case of a language pair where the corpus is not yet available as
building the corpus will require even more resources than building
the analysis/transfer/generation rules.
Finally, as you are focussing on the Indonesian language, you should
go and contact Pr Tang Enya Kong's team at Penang (UTMK, Universiti
Sains Malaysia) who already has a pretty good mock up in English
Malay based on EBMT (Example Based Machine Translation) methods.
Hope this will help you getting results in Machine Translation.
Regards,
Gilles Sérasset,
On 14 sept. 06, at 12:16, Nano Surbakti wrote:
> Dear Friends,
>
> Thanks for all explanation and guidance, they're really helping us.
> We'll go on study and choose the best one regarding our situation,
> etc. The SMT method looks more promising now, but we haven't decide to
> choose it.
>
> We're trying to find sponsor, to hire experts and accelerate the
> project. The idea just appear some weeks ago, and we have very limited
> human resources in this field (only me have background in IT, and we
> have no language expert).
>
> Thanks again,
>
> --
> Nano Surbakti
>
--
Gilles Sérasset
GETA-CLIPS-IMAG (UJF, INPG & CNRS)
BP 53 - F-38041 Grenoble Cedex 9
Phone: +33 4 76 51 43 80
Fax: +33 4 76 44 66 75
More information about the Corpora
mailing list