Seminaire: Alpage, Darja Fiser, Bilingual lexicon extraction from comparable corpora, 4 fevrier 2013

Thierry Hamon thierry.hamon at UNIV-PARIS13.FR
Sun Feb 3 14:37:39 UTC 2013

Date: Fri, 1 Feb 2013 09:17:06 +0100
From: Marie Candito <marie.candito at>
Message-ID: <CAKCM-9HZceotaCs_oTstX41HkC80X8fTfryn9jPzF+VUMo-XDw at>

************** Séminaire de l'équipe Alpage *******************

Il s'agit du séminaire de recherche en linguistique informatique
organisé par l'équipe Alpage, équipe mixte INRIA - Paris Diderot,
spécialisée en analyse syntaxique automatique et en traitement du

Le prochain séminaire se tiendra exceptionnellement un *lundi:*

* lundi 4 février de 11h à 12h30 *

en salle 3E91 à l'UFRL, 16, rue Clisson, 75013 Paris
(3e étage gauche)

Toute personne intéressée est la bienvenue.


Darja Fišer, University of Ljubljana

No dictionary? No problem!
(Bilingual lexicon extraction from comparable corpora for closely
related languages)

In smaller language communities, such as Slovenia, most funds and
efforts for bilingual dictionary creation are usually limited to a few
world languages, such as English, German, French and Italian. But very
often there is a need, either among researchers in natural language
processing or among foreign language students, for a dictionary between
less marketable language combinations as well. In this talk we propose
an attempt to close this gap economically and efficiently for a pair of
closely related languages, such as Slovene and Croatian, solely by
relying on the nonparallel corpus resources that are readily available
or can be constructed quickly and with relatively little effort. We will
show that we can build context vectors from web corpora for each
language, and then compare the vectors in the two different languages by
taking advantage of the language similarities, in order to identify
which word pairs in the two languages are contextually most
similar. Typically, a seed dictionary is needed to translate the
features of source context vectors into the target language, but because
our languages are very similar, we tackle the task by using identical
words and cognates only. In addition, we try to improve the results by
increasing the seed dictionary with the automatically extracted
translation pairs of the most frequent words in the corpus, the quality
of which is known to be very good. Finally, we perform cognate-based
reranking of the list of top 10 translation candidates, which improves
the results even further. Currently, we are trying to adapt the approach
to be able to identify false friends, which are words that look
orthographically the same in both languages but are used to denote
different concepts and are therefore used in completely different
contexts. They are extremely important in language learning because they
are a common source of errors even among advanced language learners, but
would be a welcome resource in a number of HLT tasks too. The presented
approach is knowledge-light, requiring minimal language processing
tools, such as lemmatization and PoS-tagging and could therefore be
applied to any combination of closely related languages and corpora from
any domain.

Message diffuse par la liste Langage Naturel <LN at>
Informations, abonnement :
English version       : 
Archives                 :

La liste LN est parrainee par l'ATALA (Association pour le Traitement
Automatique des Langues)
Information et adhesion  :

More information about the Ln mailing list