Appel: Revue TAL : Automated Learning of Language Models

Alexis Nasr alexis.nasr at LINGUIST.JUSSIEU.FR
Tue Sep 3 07:32:43 UTC 2002

                Automated Learning of Language Models

                              Deadline for submission :

                                     October 7, 2002

          Issue coordinated by Michèle Jardino (CNRS, LIMSI),
                 and Marc El-Beze    (LIA, University of Avignon) .

Language Models (LM) play a crucial role in the working of Automated
Natural Language Processing systems, when real-life problems (often
very large ones) are being dealt with. Instances are Speech
Recognition, Machine Translation and Information Retrieval. If we want
these systems to adapt to new applications, or to follow the evolution
in user behaviour, we need to automatize the learning of parameters in
the models we use. Adaptation should occur in advance or in real
time. Some applications do not allow us to build an adequate corpus,
either from a quantitative or qualitative point of view. The gathering
of learning data is made easier by the richness of Web resources, but
in that huge mass, we have to effectively separate the wheat from the

 When asked about the optimal size for a learning corpus, are we
satisfied to answer "The bigger, the better"?

Rather than training one LM on a gigantic learning corpus, would it
not be advisable to fragment this corpus into linguistically coherent
segments, and learn several language models, whose scores might be
combined when doing the test (model mixture)?

In the case of n-gram models, which is the optimal value for n? Should
it be fixed or variable?

A larger value allows us to capture linguistic constraints over a
context which goes beyond the mere two preceding words of the classic
trigram. However, increasing n threatens us with serious coverage
problems. Which is the best trade-off between these two opposite
constraints?  How can we smooth models in order to approximate
phenomena that have not been learned? Which alternatives are to be
chosen, using which more general information (lesser-order n-grams,

Beyond the traditional opposition between numerical and
knowledge-based approaches, there is a consensus about the
introduction of rules into stochastic models, or probability into
grammars, hoping to get the best of both strategies. Hybrid models can
be conceived in several ways, depending on which choices are made
regarding both of their sides, and also, the place where coupling
occurs. Because of discrepancies between the language a grammar
generates, and actually observed syntagms, some researchers decided to
reverse the situation and derive the grammar from observed facts.
However, this method yields disappointing results, since it does not
perform any better than n -gram methods, and is perhaps
inferior. Shouldn't we introduce here a good deal of supervision, if
we want to reach this goal?


Topics (non-exhaustive list)

In this special issue, we would like to publish either innovative
papers, or surveys and prospective essays dealing with Language Models
(LM), Automated Learning of their parameters, and covering one of
following subtopics:

     Language Models and Resources:
         determination of the adequate lexicon
         determination of the adequate corpus
     Topical Models
     LM with fixed or variable history
     Probabilistic Grammars
     Grammatical Inference
     Hybrid Language Models
     Static and dynamic adaptation of LMs
     Dealing with the Unknown
         Modelling words which do not belong to the vocabulary
         Methods for smoothing LMs
     Supervised and unsupervised learning of LMs
         Automated classification of basic units
         Introducing linguistic knowledge into LMs
     Methods for LM learning
         EM, MMI, others?
     Evaluation of Language Models
     Complexity and LM theory

     - Speech Recognition
     - Machine Translation
     Information Retrieval



Papers (25 pages maximum) are to be submitted in Word ou LaTeX. Style
sheets are available at HERMES : < >.



Articles can be written either in French or in English, but English
will be accepted from non-French speaking authors only.



Submission deadline is October 7, 2002. Authors who plan to submit a
paper are invited to contact Michèle Jardino and / or Marc El-Beze
before September 15, 2002.

Articles will be reviewed by a member of the editorial board and two
external reviewers designed by the editors of this issue. Decisions of
the editorial board and referees' report will be transmitted to the
authors before November 20, 2002.

The final version of the accepted papers will be required by February
20, 2003. Publication is planned during the spring of 2003.



Submissions must be sent electronically to:

Michèle Jardino  ( jardino at )

Marc El-Bèze   ( marc.elbeze at )

or, in paper version (four copies), posted to:

Marc El-Beze Laboratoire d'Informatique
LIA - CERI BP 1228

Message diffusé par la liste Langage Naturel <LN at>
Informations, abonnement :
English version          :
Archives                 :

La liste LN est parrainée par l'ATALA (Association pour le Traitement
Automatique des Langues)
Information et adhésion  :

More information about the Ln mailing list