[Corpora-List] [Moses-support] filter parallel corpus

Amin Farajian ma.farajian at gmail.com
Thu Jan 16 15:58:11 UTC 2014


Dear Saeed,

You can do the data selection using IRSTLM. I think it fits your need. Take
a look at the following link:
http://sourceforge.net/apps/mediawiki/irstlm/index.php?title=Data_selection

It helps you to find the subset of sentences within your large training
corpus that fits better with your test corpus.
Note that it is originally designed for the monolingual scenario. But, If
you want to filter the parallel corpus, you can do the following:

1. add line numbers to the beginning of the lines of the source side of
your training corpus.
2. Do the data selection as is described in the manual
3. Extract the corresponding translations of the selected source lines.
4. Enjoy life

Bests,
Amin



On Thu, Jan 16, 2014 at 4:43 PM, Saeed Farzi <saeedfarzi at gmail.com> wrote:

> Dear all,
>
> I am working on a translation task with a very large parallel corpus.
> Because of computational cost of training such a parallel corpus, i am
> going to filter it regarding to the test set ( of course , by the
> filtering, the evaluation must be still fair).
>
> I am looking for  a solution  or a tool for filtering parallel corpus
> sentences.
>
> Note that  i do not need to filter phrase table. I know that the
> filter_ moses tool reduces the phrase table size.
>
> cheers
> --
>            S.Farzi, Ph.D. Student
>     Natural Language Processing Lab,
>   School of Electrical and Computer Eng.,
>                Tehran University
>              Tel: +9821-6111-9719
> _______________________________________________
> Moses-support mailing list
> Moses-support at mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140116/0751b16e/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list