Soft: New major release of the continuous space LM toolkit for SMT

Wed Jun 6 07:43:50 UTC 2012

Date: Mon, 04 Jun 2012 01:14:04 +0200
From: Holger Schwenk <holger.schwenk at lium.univ-lemans.fr>
Message-ID: <4FCBEFBC.2010109 at lium.univ-lemans.fr>
X-url: http://www-lium.univ-lemans.fr/cslm/

I'm happy to announce the availability of a new version of the
continuous space language model (CSLM) toolkit.

Continuous space methods we first introduced by Yoshua Bengio in 2001
[1].  The basic idea of this approach is to project the word indices
onto a continuous space and to use a probability estimator operating on
this space.  Since the resulting probability functions are smooth
functions of the word representation, better generalization to unknown
events can be expected.  A neural network can be used to simultaneously
learn the projection of the words onto the continuous space and to
estimate the n-gram probabilities.  This is still a n-gram approach, but
the LM probabilities are interpolated for any possible context of length
n-1 instead of backing-off to shorter contexts.

CSLM were initially used in large vocabulary speech recognition systems
and more recently in statistical machine translation. Improvements in
the perplexity between 10 and 20% relative were reported for many
languages and tasks.

This version of the CSLM toolkit is a major update of the first
release. The new features include:

  - full support for short-lists during training and inference. By these
    means, the CSLM can be applied to tasks with large vocabularies.
  - very efficient n-best list rescoring.
  - support of graphical extension cards (GPU) from Nvidia. This speeds
    up training by a factor of four with respect to a high-end server
    with two CPUs.

We successfully trained CSLMs on large tasks like NIST OpenMT'12.
Training on one billion words takes less than 24 hours. In our
experiments, the CSLM achieves improvements in the BLEU score of up to
two points with respect to a large unpruned back-off LM.

A detailed description of the approach can be found in the following
publications:

[1] Yoshua Bengio and Rejean Ducharme.  A neural probabilistic language
    model. In NIPS, vol 13, pages 932--938, 2001.

[2] Holger Schwenk, Continuous Space Language Models; in Computer Speech
    and Language, volume 21, pages 492-518, 2007.

[3] Holger Schwenk, Continuous Space Language Models For Statistical
    Machine Translation; The Prague Bulletin of Mathematical
    Linguistics, number 83, pages 137-146, 2010.

[4] Holger Schwenk, Anthony Rousseau and Mohammed Attik; Large, Pruned
    or Continuous Space Language Models on a GPU for Statistical Machine
    Translation, in NAACL workshop on the Future of Language Modeling,
    June 2012.

The software is available at http://www-lium.univ-lemans.fr/cslm/. It is
distributed under GPL v3.

Comments, bug reports, requests for extensions and contributions are
welcome.

enjoy,

Holger Schwenk

LIUM
University of Le Mans
Holger.Schwenk at lium.univ-lemans.fr

-------------------------------------------------------------------------
Message diffuse par la liste Langage Naturel <LN at cines.fr>
Informations, abonnement : http://www.atala.org/article.php3?id_article=48
English version       : 
Archives                 : http://listserv.linguistlist.org/archives/ln.html
                                http://liste.cines.fr/info/ln

La liste LN est parrainee par l'ATALA (Association pour le Traitement
Automatique des Langues)
Information et adhesion  : http://www.atala.org/
-------------------------------------------------------------------------