18.2229, Qs: Semantic Similarity Experiment

Wed Jul 25 00:15:44 UTC 2007

LINGUIST List: Vol-18-2229. Tue Jul 24 2007. ISSN: 1068 - 4875.

Subject: 18.2229, Qs: Semantic Similarity Experiment

Moderators: Anthony Aristar, Eastern Michigan U <aristar at linguistlist.org>
            Helen Aristar-Dry, Eastern Michigan U <hdry at linguistlist.org>

Reviews: Randall Eggert, U of Utah  
         <reviews at linguistlist.org> 

Homepage: http://linguistlist.org/

The LINGUIST List is funded by Eastern Michigan University, 
and donations from subscribers and publishers.

Editor for this issue: Dan Parker <dan at linguistlist.org>
================================================================  

We'd like to remind readers that the responses to queries are usually
best posted to the individual asking the question. That individual is
then strongly encouraged to post a summary to the list. This policy was
instituted to help control the huge volume of mail on LINGUIST; so we
would appreciate your cooperating with it whenever it seems appropriate.

In addition to posting a summary, we'd like to remind people that it
is usually a good idea to personally thank those individuals who have
taken the trouble to respond to the query.

To post to LINGUIST, use our convenient web form at
http://linguistlist.org/LL/posttolinguist.html.

===========================Directory==============================  

1)
Date: 24-Jul-2007
From: Nuno Seco < nseco at dei.uc.pt >
Subject: Semantic Similarity Experiment

-------------------------Message 1 ---------------------------------- 
Date: Tue, 24 Jul 2007 20:14:24
From: Nuno Seco [nseco at dei.uc.pt]
Subject: Semantic Similarity Experiment
E-mail this message to a friend:
http://linguistlist.org/issues/emailmessage/verification.cfm?iss=18-2229.html&submissionid=152144&topicid=8&msgnumber=1

In the context of joint research project we are asking fellow researchers
to contribute about 10 min of their time and collaborate in an experiment
that (we hope) will help us gather a large dataset of similarity ratings
for pairs of words. Participation is quite simple, so if you are interested
please read the section HOW TO PARTICIPATE. If you want to learn more about
the experiment please read the section INTRODUCTION.

Thanks in advance,

Giuseppe Pirrò & Nuno Seco

Introduction:

Semantic similarity plays an important role in Information Retrieval,
Natural Language Processing, Ontology Mapping and other related fields of
research.

In particular researchers have developed a variety of semantic similarity
and relatedness measures by exploiting information found in lexical
resources such as WordNet. Current similarity metrics based on WordNet can
be classified in one of the following categories:

Edge-Counting measures that are based on the number of links relating two
concepts that are being compared.

Information Content measures that are based on the idea that the similarity
of two concepts is related to the amount of information they have in common.

Feature-Based measures that exploit the features (e.g., descriptions in
natural language) of a term while usually ignoring their location in the
taxonomy.

Hybrid measures that combine ideas from previous categories.

In order to evaluate the suitability of the various similarity measures
they are usually compared against human judgements by calculating
correlation values. A typical reference, in terms of evaluation, are the
results of the Rubenstein and Goodenough (R&G) experiment. R&G in 1965
obtained ''synonymy judgments'' of 51 human subjects on 65 pairs of words.
The pairs ranged from ''highly synonymous'' (gem-jewel) to ''semantically
unrelated'' (noon-string). Subjects were asked to rate them on the scale of
0.0 to 4.0 according to their ''similarity of meaning'' and ignoring any
other observed semantic relationships.

Even if from the R&G experiment, other similar experiments have been
carried out, we are not aware of similarity experiments aimed at showing
how robust the different measures are when compared against different
versions of WordNet. With this objective in mind we want to collect human
similarity estimations on the whole Rubsteing and Goodenough dataset and
subsequentially compare outputs of existing similarity measures. We chose
to adopt the R&G dataset since others have worked on it, thus permitting
direct comparison of results obtained by different experiments.

Moreover, we want to show the suitability of an Information Content metric
that solely relies on the WordNet taxonomy, without relying on external
collection of texts.

How to participate:

In order to participate in the similarity experiment point your browser to:
http://grid.deis.unical.it/similarity/

Then by clicking on the register link you can register and immediately
receive a password via email.

After logging in you should indicate similarity values for all the word
pairs by using the Slider provided for each pair. The estimated time
required is about 10 minutes including time for registering.

Results of the experiment and the data will be published as soon as we
collect a significant amount of ratings. 

Linguistic Field(s): Computational Linguistics
                     Semantics

-----------------------------------------------------------
LINGUIST List: Vol-18-2229