Appel: Journal of Natural Language Engineering, Special Issue on Machine Translation Using Comparable Corpora

Thierry Hamon hamon at LIMSI.FR
Sun Jun 22 19:21:24 UTC 2014

Date: Fri, 20 Jun 2014 12:30:33 +0200
From: "Reinhard Rapp" <reinhardrapp at>
Message-ID: <45AAADB6E08D4114A33C67C5B869DAD2 at ASUSPC>

***** Journal of Natural Language Engineering - Special Issue on
      “Machine Translation Using Comparable Corpora” *****


Statistical machine translation based on parallel corpora has been very
successful. The major search engines' translation systems, which are
used by millions of people, are primarily using this approach, and it
has been possible to come up with new language pairs in a fraction of
the time that would be required when using more traditional rule-based

In contrast, research on comparable corpora is still at an earlier
stage. Comparable corpora can be defined as monolingual corpora covering
roughly the same subject area in different languages but without being
exact translations of each other.

However, despite its tremendous success, the use of parallel corpora in
MT has a number of drawbacks:

1) It has been shown that translated language is somewhat different from
   original language, for example Klebanov & Flor showed that
   "associative texture" is lost in translation.

2) As they require translation, parallel corpora will always be a far
   scarcer resource than comparable corpora. This is a severe drawback
   for a number of reasons:

a) Among the about 7000 world languages, of which 600 have a written
   form, the vast majority are of the "low resource" type.

b) The number of possible language pairs increases with the square of
   the number of languages. When using parallel corpora, one bitext is
   needed for each language pair. When using comparable corpora, one
   monolingual corpus per language suffices.

c) For improved translation quality, translation systems specialized on
   particular genres and domains are desirable. But it is far more
   difficult to acquire appropriate parallel rather than comparable
   training corpora.

d) As language evolves over time, the training corpora should be updated
   on a regular basis. Again, this is more difficult in the parallel

For such reasons it would be a big step forward if it were possible to
base statistical machine translation on comparable rather than on
parallel corpora: The acquisition of training data would be far easier,
and the unnatural "translation bias" (source language shining through)
within the training data could be avoided.

But is there any evidence that this is possible? Motivation for using
comparable corpora in MT research comes from a cognitive perspective:
Experience tells that persons who have learned a second language
completely independently from their mother tongue can nevertheless
translate between the languages. That is, human performance shows that
there must be a way to bridge the gap between languages which does not
rely on parallel data. Using parallel data for MT is of course a nice
shortcut. But avoiding this shortcut by doing MT based on comparable
corpora may well be a key to a better understanding of human
translation, and to better MT quality.

Work on comparable corpora in the context of MT has been ongoing for
almost 20 years. It has turned out that this is a very hard problem to
solve, but as it is among the grand challenges in multilingual NLP,
interest has steadily increased. Apart from the increase in publications
this can be seen from the considerable number of research projects (such
as ACCURAT and TTC) which are fully or partially devoted to MT using
comparable corpora. Given also the success of the workshop series on
“Building and Using Comparable Corpora“ (BUCC), which is now in its
seventh year, and following the publication of a related book
(, we think
that it is now time to devote a journal special issue to this field. It
is meant to bundle the latest top class research, make it available to
everybody working in the field, and at the same time give an overview on
the state of the art to all interested researchers.


We solicit contributions including but not limited to the following

- Comparable corpora based MT systems (CCMTs)
- Architectures for CCMTs
- CCMTs for less-resourced languages
- CCMTs for less-resourced domains
- CCMTs dealing with morphologically rich languages
- CCMTs for spoken translation
- Applications of CCMTs
- CCMT evaluation
- Open source CCMT systems
- Hybrid systems combining SMT and CCMT
- Hybrid systems combining rule-based MT and CCMT 
- Enhancing phrase-based SMT using comparable corpora
- Expanding phrase tables using comparable corpora
- Comparable corpora based processing tools/kits for MT
- Methods for mining comparable corpora from the Web
- Applying Harris' distributional hypothesis to comparable corpora
- Induction of morphological, grammatical, and translation rules from
  comparable corpora
- Machine learning techniques using comparable corpora
- Parallel corpora vs. pairs of non-parallel monolingual corpora
- Extraction of parallel segments or paraphrases from comparable corpora
- Extraction of bilingual and multilingual translations of single words
  and multi-word expressions, proper names, and named entities from
  comparable corpora

December 1, 2014: Paper submission deadline
February 1, 2015: Notification
May 1, 2015: Deadline for revised papers
July 1, 2015: Final notification
September 1, 2015: Final paper due


Reinhard Rapp, Universities of Aix Marseille (France) and Mainz
Serge Sharoff, University of Leeds (UK)
Pierre Zweigenbaum, LIMSI, CNRS (France)


Please use the following e-mail address to contact the guest editors:
jnle.bucc (at) limsi (dot) fr

Further details on paper submission will be made available in due course
at the BUCC website:

Message diffuse par la liste Langage Naturel <LN at>
Informations, abonnement :
English version       : 
Archives                 :

La liste LN est parrainee par l'ATALA (Association pour le Traitement
Automatique des Langues)
Information et adhesion  :

ATALA décline toute responsabilité concernant le contenu des
messages diffusés sur la liste LN

More information about the Ln mailing list