Seminaire: S. Bandyopadhyay, Methodes d'apprentissage pour la reconnaissance des entites nommees, LIPN, 16 octobre 2007

Thierry Hamon thierry.hamon at LIPN.UNIV-PARIS13.FR
Wed Oct 10 07:35:02 UTC 2007

Date: Tue, 09 Oct 2007 11:45:49 +0200
From: Thierry.Poibeau at
Message-ID: <20071009114549.jxec5i2tcf4088o0 at>


Le Laboratoire d'Informatique de Paris-Nord (LIPN) accueillera le
mardi 16 octobre à partir de 14 heures Sivaji Bandyopadhyay (Jadavpur
University, Inde) pour un séminaire sur l'utilisation de méthodes
d'apprentissage pour la reconnaissance des entités nommées. Le système
de reconnaissance des entités est ensuite utilisé pour différentes
tâches de traitement des langues, comme la traduction automatique ou
le résumé multi-documents (voir résumé ci-dessous).

Le séminaire aura lieu le mardi 16 octobre en sale B311 du LIPN, à
partir de 14 heures. Pour se rendre au LIPN :

Sivaji Bandyopadhyay est invité dans le cadre du programme d'échanges
STIC-Asie coordonné par Patrick Saint-Dizier (IRIT, Toulouse).

Thierry Poibeau


Abstract: Named Entity Recognition, Transliteration and Use in MT
(Machine Translation), TDT (Topic Detection and Tracking) and MDS
(Multi-Document Summarization)

The current trend in NER is to use the machine-learning approach,
which is more attractive in that it is trainable and adoptable and the
maintenance of a machine-learning system is much cheaper than that of
a rule-based one. We have developed the Named Entity Recognition (NER)
systems for Bengali using various techniques like pattern directed
shallow parsing approach without and with linguistic knowledge,
statistical Hidden Markov Model (HMM), Maximum Entropy (ME) Model,
Conditional Random Field (CRF) and Support Vector Machine (SVM).
Named Entity Recognition in Indian languages (ILs) particularly in
Bengali is difficult and challenging as there is no concept of
capitalization in ILs as like English. A web?based tagged Bengali news
corpus of approximately 34 million wordforms in UTF-8 has been
developed from the web archive of a leading Bengali newspaper and a
part of this corpus has been used in NER tasks. All the systems have
been evaluated and the SVM based model has outperformed others with an
overall F-Score of 91.8%.

We have used a modified joint source-channel model for named entity
transliteration and this has been used for transliteration among
English and Bengali. We are using the named entity tags for English
named entities in an English-Bengali Machine translation system. We
have recently started work on Story Link Detection in which each news
story is represented as a collection of four vectors: locations,
proper names, temporal expressions and general terms.  The 4-vector
representation of each news document will be used to measure the
similarity between two news documents. The 4-vector representation of
news stories and the similarity measure of news stories can be used
further towards multidocument summarization of news stories.

Message diffuse par la liste Langage Naturel <LN at>
Informations, abonnement :
English version       : 
Archives                 :

La liste LN est parrainee par l'ATALA (Association pour le Traitement
Automatique des Langues)
Information et adhesion  :

More information about the Ln mailing list