Soft: New major release of the continuous space LM toolkit for SMT
Thierry Hamon
thierry.hamon at UNIV-PARIS13.FR
Wed Jun 6 07:43:50 UTC 2012
Date: Mon, 04 Jun 2012 01:14:04 +0200
From: Holger Schwenk <holger.schwenk at lium.univ-lemans.fr>
Message-ID: <4FCBEFBC.2010109 at lium.univ-lemans.fr>
X-url: http://www-lium.univ-lemans.fr/cslm/
I'm happy to announce the availability of a new version of the
continuous space language model (CSLM) toolkit.
Continuous space methods we first introduced by Yoshua Bengio in 2001
[1]. The basic idea of this approach is to project the word indices
onto a continuous space and to use a probability estimator operating on
this space. Since the resulting probability functions are smooth
functions of the word representation, better generalization to unknown
events can be expected. A neural network can be used to simultaneously
learn the projection of the words onto the continuous space and to
estimate the n-gram probabilities. This is still a n-gram approach, but
the LM probabilities are interpolated for any possible context of length
n-1 instead of backing-off to shorter contexts.
CSLM were initially used in large vocabulary speech recognition systems
and more recently in statistical machine translation. Improvements in
the perplexity between 10 and 20% relative were reported for many
languages and tasks.
This version of the CSLM toolkit is a major update of the first
release. The new features include:
- full support for short-lists during training and inference. By these
means, the CSLM can be applied to tasks with large vocabularies.
- very efficient n-best list rescoring.
- support of graphical extension cards (GPU) from Nvidia. This speeds
up training by a factor of four with respect to a high-end server
with two CPUs.
We successfully trained CSLMs on large tasks like NIST OpenMT'12.
Training on one billion words takes less than 24 hours. In our
experiments, the CSLM achieves improvements in the BLEU score of up to
two points with respect to a large unpruned back-off LM.
A detailed description of the approach can be found in the following
publications:
[1] Yoshua Bengio and Rejean Ducharme. A neural probabilistic language
model. In NIPS, vol 13, pages 932--938, 2001.
[2] Holger Schwenk, Continuous Space Language Models; in Computer Speech
and Language, volume 21, pages 492-518, 2007.
[3] Holger Schwenk, Continuous Space Language Models For Statistical
Machine Translation; The Prague Bulletin of Mathematical
Linguistics, number 83, pages 137-146, 2010.
[4] Holger Schwenk, Anthony Rousseau and Mohammed Attik; Large, Pruned
or Continuous Space Language Models on a GPU for Statistical Machine
Translation, in NAACL workshop on the Future of Language Modeling,
June 2012.
The software is available at http://www-lium.univ-lemans.fr/cslm/. It is
distributed under GPL v3.
Comments, bug reports, requests for extensions and contributions are
welcome.
enjoy,
Holger Schwenk
LIUM
University of Le Mans
Holger.Schwenk at lium.univ-lemans.fr
-------------------------------------------------------------------------
Message diffuse par la liste Langage Naturel <LN at cines.fr>
Informations, abonnement : http://www.atala.org/article.php3?id_article=48
English version :
Archives : http://listserv.linguistlist.org/archives/ln.html
http://liste.cines.fr/info/ln
La liste LN est parrainee par l'ATALA (Association pour le Traitement
Automatique des Langues)
Information et adhesion : http://www.atala.org/
-------------------------------------------------------------------------
More information about the Ln
mailing list