[Corpora-List] Developing and testing new similarity measures for word clustering

Adam Kilgarriff adam at lexmasterclass.com
Sun Oct 10 09:15:10 UTC 2004


Normand,

There's a growing literature on thesaurus evaluation which has seen
really interesting recent developments.  It starts from thesis work by
Sparck Jones (1960s), Hindle (1990) then Grefenstette (1994), Lilian Lee
(eg ACL 99), Dekang Lin (eg COLING 1998), more recently (eg 2003-04)
James Curran (Edinburgh/Sydney, who did very extensive experimentation,
evaluating against Roget, WordNet, and other human-made thesauruses) and
Julie Weeds (Sussex, also work of hers with David Weir which presents a
nice theoretical analysis of various measures in terms of their
precision vs recall properties)

Sorry if you knew all this already

Adam

-----Original Message-----
From: owner-corpora at lists.uib.no [mailto:owner-corpora at lists.uib.no] On
Behalf Of Normand Peladeau
Sent: 08 October 2004 13:47
To: CORPORA at uib.no
Subject: [Corpora-List] Developing and testing new similarity measures
for word clustering

I have been reviewing some of the similarity measures used to perform
word
clustering (Jaccard, Dice, Simple Matching, correlation, etc.) and I
came
to the conclusion that many of those measures had some metric problems
that
probably make them non optimal for word clustering.

I am working now on some modified versions of those indices and I need
some
ways to benchmark those new similarity measures.  I would like to have a

series of benchmarks for several kinds of application (dimension
reduction,
automatic identification of themes, automatic taxonomy development,
etc.).

I would like suggestions for ways to benchmark those new measures and
compare their performance with the more traditional ones.  Any idea,
reference, data set would be welcome.

I am also looking for existing articles where those measures have been
compared (either empirically or theoretically)


Thanks,

Normand Peladeau
Provalis Research



More information about the Corpora mailing list