[Corpora-List] Faster tool for WordNet Similarity measures

Wed Feb 2 21:22:00 UTC 2011

Hello Suzan and everyone else,

Perhaps the problem you are working on requires WordNet senses, but if it does not you might also consider using Open Roget's Thesaurus (http://rogets.site.uottawa.ca/).  It is a Java API we have developed for the 1911 version of Roget's Thesaurus.  Much like the WordNet Similarity module it does require an index to be loaded but after loading we have found it to be much faster when measuring relatedness between pairs of words.

The improved speed comes from the fact that in Roget's Thesaurus all words appear at the bottom of its 9 level hierarchy, so all word senses can be represented by a set of 9 numbers.  As such calculating the path distance between two word senses can be done in constant time, after looking up the words in a hash-table.  Even if the words have many senses it runs very quickly to calculate the semantic distance.

Our experiments on the 353sim dataset showed Open Roget's to be comparable to many of the WordNet based measures (though not quite as good as Lesk) in terms of correlation, while running in a fraction of the time.  Of course, its lexicon is now quite dated, however it may be suitable, depending on the problem it is applied to.

Alistair Kennedy
PhD Candidate
University of Ottawa

On 2011-02-02, at 9:03 AM, Eneko Agirre wrote:

> 
> Hi Suzan, all,
> 
> another option is to use UKB for word similarity / relatedness (http://ixa2.si.ehu.es/ukb/). It's based on random walks over knowledge base graphs, and it has produced the best WordNet-based results on the 353sim dataset to date (as reported in several papers which you can check in the website). The random walk software is programmed in C++. The similarity / relatedness in Perl.
> 
> The random walks are the most costly part of the process, so we have computed random walks for all WordNet lemmas (available in the website, 1.2 G), and thus the similarity/relatedness algorithm just needs to do a vector comparison. To improve speed further, the precomputed vectors contain 1000 components (instead of the ca. 120000 in the full WordNet graph). The results on the 353sim dataset using 1000 components or the full vectors where nearly identical.
> 
> best
> 
> eneko
> 
>> Date: Tue, 1 Feb 2011 10:25:23 +0100
>> From: Suzan Verberne<s.verberne at let.ru.nl>
>> Subject: [Corpora-List] Faster tool for WordNet Similarity measures
>> To: Corpora List<corpora at uib.no>
>> 
>> Hi all,
>> 
>> I have previously been using Pedersen's WordNet Similarity module (
>> http://search.cpan.org/dist/WordNet-Similarity/lib/WordNet/Similarity.pm
>> ) for calculating the similarity or relatedness between pairs of
>> words. Now I started to use it again but I noticed that it is way too
>> slow for a real-time application (which is what I need now).
>> 
>> I originally wrote a simple Perl script that calls the module (shown
>> below) but it takes almost five seconds to run. Almost all this time
>> is spent on calling the module so for batch scripts it is fine (then
>> the module is only called once for multiple requests), but I need it
>> to work in real time in a retrieval experiment and then 5 seconds is
>> too long.
>> 
>> Does anyone know an alternative (fast!) tool for calculating
>> Similarity and/or Relatedness between two words? It might be using
>> either a Wu&  Palmer-like measure or a Lesk-type measure.
>> 
>> Thanks!
>> Suzan Verberne
>> 
>> #! /usr/bin/perl
>>  use WordNet::QueryData;
>>  use WordNet::Similarity::path;
>>  my $wn = WordNet::QueryData->new;
>>  my $measure = WordNet::Similarity::path->new ($wn);
>>  my $value = $measure->getRelatedness("car#n#1", "bus#n#2");
>>  print "car (sense 1)<->  bus (sense 2) = $value\n";
>> 
>> 
>> -- 
>> Suzan Verberne, postdoctoral researcher
>> Centre for Language and Speech Technology
>> Radboud University Nijmegen
>> Tel: +31 24 3611134
>> Email: s.verberne at let.ru.nl
>> http://lands.let.ru.nl/~sverbern/
>> --
>> 
>> _______________________________________________
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
> 
> 
> -- 
> ----------- NEW URL: http://ixa2.si.ehu.es/eneko ------------
> 
> Eneko Agirre                                                .
> Informatika Fakultatea                mailto: e.agirre at ehu.es
> Manuel Lardizabal, 1                                        .
> 20.018 Donostia                         fax: (+34) 943 015590
> Basque Country (via Spain)              tel: (+34) 943 015019
> 
> 
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora