[Corpora-List] Bilingual Dictionary from Comparable Corpora

Alberto Simões albie at alfarrabio.di.uminho.pt
Sat Oct 11 08:52:53 UTC 2014


Dear Javid,

For parallel corpora, NATools can handle that size of corpora (it works 
in chunks). Nevertheless, it is not prepared to handle comparable corpora :(

On 11/10/14 09:34, javid dadashkarimi wrote:
> Hi everybody,
> "Thank you so much for your useful suggestions",
> However, the size of the our corpora is almost 20 GB and we have memory
> problem. Indeed, we have 300K target unique words and 750K alignments
> and we can not load document-word or word-alignments matrices in the
> memory. How can I use the tools efficiently?
> Best,
> Javid
>
> On Thu, Oct 9, 2014 at 2:34 AM, Reinhard Rapp <reinhardrapp at gmx.de
> <mailto:reinhardrapp at gmx.de>> wrote:
>
>     Dear all,
>
>     I would like to point to the work done by Tomas Mikolov, Quoc V. Le,
>     and Ilya Sutskever:
>
>     http://arxiv.org/abs/1309.4168
>
>     It seems that there is code available for this (see footnote 1) of
>     the paper.
>
>     There is also a popular science article on this approach:
>
>     http://www.technologyreview.__com/view/519581/how-google-__converted-language-__translation-into-a-problem-of-__vector-space-mathematics/
>     <http://www.technologyreview.com/view/519581/how-google-converted-language-translation-into-a-problem-of-vector-space-mathematics/>
>
>     Together with Michael Zock I organized a shared task on
>     multi-stimulus association at the COLING 2014 workshop on Cognitive
>     Aspects of the Lexicon (CogALex-IV) and from this I know that
>     systems using Mikolov et al.'s neural network-based language
>     modelling approach perform extremely well in the monolingual case
>     (see e.g. the first 4 papers in the workshop proceedings to be found
>     at http://aclanthology.info/__events/cogalex-2014#W14-47
>     <http://aclanthology.info/events/cogalex-2014#W14-47>).
>
>     Let me also mention that we (Pierre Zweigenbaum, Serge Sharoff, and
>     myself) are currently serving as guest editors for a special issue
>     of the Journal of Natural Language Engineering (JNLE) on the topic
>     of "Machine Translation Using Comparable Corpora":
>     http://comparable.limsi.fr/__jnle-bucc2015/
>     <http://comparable.limsi.fr/jnle-bucc2015/> (submissions welcome,
>     deadline Dec. 1, 2014). If you are working in this field, but will
>     not be able to submit a paper yourself, please let us know about
>     your work (especially if it is not already mentioned in the
>     introductory chapter of the volume "Building and Using Comparable
>     Corpora", see Serge's  previous e-mail in this thread) as we are
>     preparing an overview article which aims to be as comprehensive as
>     possible.
>
>     Many thanks and kind regards,
>
>     Reinhard
>
>     -----Ursprüngliche Nachricht----- From: inguna.skadina at lumii.lv
>     <mailto:inguna.skadina at lumii.lv>
>     Sent: Tuesday, October 7, 2014 8:48 AM
>     To: IngunaSkadiņa
>     Cc: corpora at uib.no <mailto:corpora at uib.no> ;
>     gate-users-request at lists.__sourceforge.net
>     <mailto:gate-users-request at lists.sourceforge.net>
>     Subject: Re: [Corpora-List] Bilingual Dictionary from Comparable Corpora
>
>     Dear Javid,
>
>
>     The ACCURAT toolkit (http://accurat-project.eu/) allows to identify
>     semi-parallel sentences in comparable corpora and extract
>     dictionary/translation table from them (with support of GIZA+++).
>
>     I hope, you will find it useful.
>
>     Best wishes,
>     Inguna Skadiņa
>
>         Citējot javid dadashkarimi <javiddadashkarimi at gmail.com
>         <mailto:javiddadashkarimi at gmail.com>>:
>
>             Hi,
>             Is there any tool for extracting probabilistic bilingual
>             dictionary for a
>             bilingual comparable corpora? Does Moses support such a task?
>             Best,
>             Javid
>
>
>
>
>
>
>
>
>     _________________________________________________
>     UNSUBSCRIBE from this page: http://mailman.uib.no/options/__corpora
>     <http://mailman.uib.no/options/corpora>
>     Corpora mailing list
>     Corpora at uib.no <mailto:Corpora at uib.no>
>     http://mailman.uib.no/__listinfo/corpora
>     <http://mailman.uib.no/listinfo/corpora>
>
>     _________________________________________________
>     UNSUBSCRIBE from this page: http://mailman.uib.no/options/__corpora
>     <http://mailman.uib.no/options/corpora>
>     Corpora mailing list
>     Corpora at uib.no <mailto:Corpora at uib.no>
>     http://mailman.uib.no/__listinfo/corpora
>     <http://mailman.uib.no/listinfo/corpora>
>
>
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list