[Corpora-List] Using corpora in SMT

Mon Sep 21 11:41:21 UTC 2009

2009/9/21 Paul Johnston <paul.a.johnston at manchester.ac.uk>:
> Apologies for being a bit off topic but several years ago I built a toy
> Statistical Machine Translation system using a hand crafted Estonian-English
> corpus to generate the translation model, the BNC as the language model and
> Giza, The CMU Toolkit, Perl and the ISI decoder to actually implement the
> system.
>
> I added a small level of morphological processing which greatly increased
> the performance by extracting case information from the Estonian texts.
>
> It was good fun and very interesting but as it was some time ago I wonder
> what is available if I were to repeat the exercise half a decade later.
>
> The computing power I have has increased a lot, especially in the area of
> storage and I could get a lot bigger parallel corpus now.
>
> What is there new to play with?

Moses (http://statmt.org/moses/) is an open source SMT decoder which
is quite widely used. It can use either IRSTLM or RandLM, both of
which have advantages over the CMU stuff. GIZA++ is still the standard
tool for word alignment, but there are versions out there that can
take advantage of multiple processors/threading
(http://www.cs.cmu.edu/~qing/). There's also the Berkeley Aligner
(http://code.google.com/p/berkeleyaligner/), which can use source
language parse trees to get better alignments.

The JRC Acquis corpus includes English-Estonian, among others
(http://wt.jrc.it/lt/Acquis/).

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora