Seminaire: K. Gabor (Inst. Ling. de Budapest), LIPN, selection de features pour le clustering de verbes

Thierry Hamon thierry.hamon at LIPN.UNIV-PARIS13.FR
Fri Nov 7 15:12:53 UTC 2008

Date: Fri, 07 Nov 2008 11:37:28 +0100
From: Antoine Rozenknop <Antoine.Rozenknop at>
Message-ID: <49141A68.4010404 at>

Le LIPN accueillera Kata Gábor, de l'Institut Linguistique de
Budapest, pour un séminaire qui portera sur des techniques de
clustering appliquées à des données linguistiques.

Le séminaire aura lieu au LIPN (Univ.Paris 13, campus de Villetaneuse)
en salle B311, le MERCREDI 12 NOVEMBRE 2008 à 14 heures.

Pour venir au LIPN, voir les informations sur :

Antoine Rozenknop

- ---
Notes :

L'objectif des techniques présentées est d'obtenir des listes de
données lexicales cohérentes (ici, des verbes). Il s'agit d'un enjeu
important pour un certain nombre d'applications d'ingénierie
linguistique comme les systèmes de questions-réponses, d'extraction
d'information ou de traduction automatique. L'exposé montrera des
exemples tirés du hongrois mais concerne potentiellement d'autres
langues dont, évidemment, le français.

- ------------------------
Title : Feature Selection for Semantic Clustering of Hungarian Verbs
- ------------------------

Summary :

The presentation will focus on current issues in semantic verb
classification, with a particular emphasis on the choice of the
feature set which represent verbal syntactic distribution.
Experiments on supervised semantic verb classification generally aim
at obtaining verb classes equivalent to Beth Levins English verb
classification or WordNet synsets. For languages which lack such
hand-made resources, verb classes can be obtained by unsupervised
clustering. In either case, the underlying hypothesis is that
syntactically similar verbs share one or more meaning components, and
an adequate representation of verbs distributional context is crucial
to the succes of the experiment.

The first experiments for clustering Hungarian verbs were carried out
using data from a manually annotated treebank. The 150 most frequent
verbs were categorized according to the complementation patterns they
exhibit in the treebank. There was no limit on the length of the
patterns, and adjuncts were also included. As Hungarian is a highly
inflective language (with 19 different case suffixes), a huge quantity
of different patterns, i.e., a large feature set was used. Despite the
fact that the large number of features scatter frequency data, the
results were promising. However, in order to extend the clustering to
less freqent verbs, one needs a bigger and, consequently,
automatically parsed corpus. As the parser introduces more noise,
which is an even more sensitive issue when dealing with medium or low
frequency verbs, the question raises of how the feature set can be
tuned in order to achieve a more precise description of the syntactic
distribution of verbs.
An obvious solution would be to filter complementation patterns
according to their frequency, but this could yield misleading reults
for low frequency verbs. A more sophisticated method would be to use a
manually built verbal valency dictionary to filter out longer and/or
less relevant distributional contexts. On the other hand, this method
would imply losing the information carried by adjuncts (e.g. temporal
andjuncts reveal aspectual properties of verbs).

After presenting the results of the first clustering experiment, I
will outline future research directions with respect to feature set
reduction, and discuss the advantages and disadvantages of the
particular methods.

- ----
Séminaires RCLN :
- ----

Message diffuse par la liste Langage Naturel <LN at>
Informations, abonnement :
English version       : 
Archives                 :

La liste LN est parrainee par l'ATALA (Association pour le Traitement
Automatique des Langues)
Information et adhesion  :

More information about the Ln mailing list