[Corpora-List] Using corpora in SMT

Mon Sep 21 11:27:32 UTC 2009

Apologies for being a bit off topic but several years ago I built a toy
Statistical Machine Translation system using a hand crafted
Estonian-English corpus to generate the translation model, the BNC as
the language model and Giza, The CMU Toolkit, Perl and the ISI decoder
to actually implement the system.

I added a small level of morphological processing which greatly
increased the performance by extracting case information from the
Estonian texts.

It was good fun and very interesting but as it was some time ago I
wonder what is available if I were to repeat the exercise half a decade
later.

The computing power I have has increased a lot, especially in the area
of storage and I could get a lot bigger parallel corpus now.

What is there new to play with?

Regards Paul

Paul Johnston

Humanities ICT (Infrastructure)

Samuel Alexander Building

Room W1.19

e-mail Paul.Johnston at manchester.ac.uk

web http://web-1.humanities.manchester.ac.uk/prjs/mcasspj/

Tuzoqlar granatalardan yuksak darajali portlovchi moddalardan yoki
bosshqa narslardan qilingan?

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20090921/2a9e6700/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora