[Corpora-List] [Moses-support] filter parallel corpus
Amin Farajian
ma.farajian at gmail.com
Thu Jan 16 15:58:11 UTC 2014
Dear Saeed,
You can do the data selection using IRSTLM. I think it fits your need. Take
a look at the following link:
http://sourceforge.net/apps/mediawiki/irstlm/index.php?title=Data_selection
It helps you to find the subset of sentences within your large training
corpus that fits better with your test corpus.
Note that it is originally designed for the monolingual scenario. But, If
you want to filter the parallel corpus, you can do the following:
1. add line numbers to the beginning of the lines of the source side of
your training corpus.
2. Do the data selection as is described in the manual
3. Extract the corresponding translations of the selected source lines.
4. Enjoy life
Bests,
Amin
On Thu, Jan 16, 2014 at 4:43 PM, Saeed Farzi <saeedfarzi at gmail.com> wrote:
> Dear all,
>
> I am working on a translation task with a very large parallel corpus.
> Because of computational cost of training such a parallel corpus, i am
> going to filter it regarding to the test set ( of course , by the
> filtering, the evaluation must be still fair).
>
> I am looking for a solution or a tool for filtering parallel corpus
> sentences.
>
> Note that i do not need to filter phrase table. I know that the
> filter_ moses tool reduces the phrase table size.
>
> cheers
> --
> S.Farzi, Ph.D. Student
> Natural Language Processing Lab,
> School of Electrical and Computer Eng.,
> Tehran University
> Tel: +9821-6111-9719
> _______________________________________________
> Moses-support mailing list
> Moses-support at mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140116/0751b16e/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list