[Corpora-List] Faster tool for WordNet Similarity measures

Nitin Madnani nmadnani at gmail.com
Tue Feb 1 18:20:19 UTC 2011


Ted,

I agree that rewriting in Python is not likely to provide much speed
up. I have contributed to NLTK's (python-based) WordNet similarity
implementation (that has been inspired by WordNet::Similarity) and it
isn't that much faster than then Perl version.

I think that pre-computing is actually an interesting idea. It might
be possible to pre-compute the similarity matrics using a combination
of Hadoop and EC2.

I will concur that parallel-computation (either via a server or,
again, via Hadoop+EC2) is definitely a relatively easy solution. I
wasn't aware of similarity_server.pl but I have, in the past, used a
homegrown XML-RPC server to offset start-up cost for similarity
computation.

Nitin

On Tue, Feb 1, 2011 at 1:06 PM, Ted Pedersen <tpederse at d.umn.edu> wrote:
> Hi Suzan,
>
> If you locate such an alternative I'm absolutely dying to know. :)
>
> It would also be helpful to know how fast would be considered *fast*
> or fast enough.  I did a quick little benchmark using 10,000 randomly
> generated noun pairs (using our randomPairs.pl program from
> http://www.d.umn.edu/~tpederse/wordnet.html )  I used a rather elderly
> machine with a 2.0 GhZ Xeon processor, and ran Wu and Palmer (wup),
> Adapted Lesk (lesk) and Shortest Path (path) on those 10,000 pairs.
>
> path finished in 153 seconds, so about 65 pairs a second
> wup finished in 268 seconds, so about 37 pairs a second
> lesk finished in 434 second, so about 27 pairs a second
>
> Now this isn't blazing fast, but on the other it doesn't strike me as
> unreasonably slow either. What would be your target for pairs per
> second? That could help figure out an appropriate solution...Also, if
> you are finding that it takes considerably longer than the times I
> report above, please do let me know as there might be some other issue
> that is slowing you down.
>
> For those who are truly interested in WordNet::Similarity performance,
> here are a few more thoughts - I think about this issue regularly, so
> any ideas or alternative suggestions are very much appreciated.
>
> You are right in noting that the "load time" is the big factor with a
> small number of measurements, and that processing 1 pair is much much
> more expensive than processing 100 or 1000. Performance related issues
> like this have been discussed a bit on the WordNet Similarity mailing
> list, although it sounds like you already understand the basics pretty
> well.  Here's a sampling of those discussions....
>
> http://www.mail-archive.com/wn-similarity@yahoogroups.com/msg00068.html
> http://www.mail-archive.com/wn-similarity@yahoogroups.com/msg00226.html
> http://www.mail-archive.com/wn-similarity@yahoogroups.com/msg00474.html
> http://www.mail-archive.com/wn-similarity@yahoogroups.com/msg00499.html
>
> There are a few things to consider - lesk is among the slower of the
> measures, because it does string overlap matching  for each pair of
> extended glosses that are constructed. These extended glosses can be
> rather big (hundreds of words) so finding the string overlaps is
> rather time consuming. Wu and Palmer (wup), on the other hand, is
> fairly swift in that it relies on depth information (for each sense)
> and we pre-compute those at run time (how deep is each concept is
> found once and then re-used). So overall that's a fairly nice choice
> in terms of speed. As you can see above path is even quicker, because
> it's only finding shortest paths between the senses.
>
> It's often suggested that we should pre-compute the similarity and
> relatedness values, and make them available as some sort of table or
> database. That would certainly speed things up, but it's also a rather
> enormous task in that if we are talking about roughly a 200,000 x
> 200,000 relatedness matrix for lesk, vector, vector-pairs and hso.
> Those measures can be computing between any parts of speech, so you
> need to compute all the pairwise relatedness values for everything in
> WordNet. Given that the relatedness measures are generally the slowest
> it would be a time consuming operation to build that matrix.
>
> The similarity measures on the other hand, are slightly more tractable
> in that they are limited to doing noun-noun and verb-verb comparisons.
> So that means we could pre-compute 140,000 x 140,000 matrices for the
> nouns, and 25,000 x 25,000 for the verbs. In fact doing the verbs
> isn't so bad, and if there's any interest I could surely run something
> to generate those matrices. The similarity nouns and the relatedness
> measures themselves would require a bit more patience. Note the
> matrices are symmetric, so it's not *quite* as bad as it first
> appears.
>
> I should also say that a number of people have said they were going to
> try and do this sort of pre-computation of values, but I can't recall
> if anyone actually finished that and/or made it available. If so that
> could be something to consider (and it might be nice for anyone who
> has done that to remind us as it might be generally pretty useful).
>
> The downside to pre-computing is that it seems to me there are many
> options that can be used for some of the measures (all the relatedness
> measures and the info content measures for sure) so a pre-computed set
> of values would only reflect one particular way of using those
> measures - so, one size may not fit all. For the path based similarity
> measures (path, wup, lch) I don't think this is much of an issue as
> there aren't so many options, and so in general everyone uses those in
> the same way and a pre-computed table of similarity measures might be
> very helpful.
>
> Other suggestions have included - rewrite in Java, Python, C++, C,
> Fortran, anything but Perl!! :) - I don't know if I see much advantage
> to that in terms of speed - Perl uses a lot of memory and that can
> cause performance to lag in some situations, but in general
> WordNet::Similarity doesn't use much memory so reducing memory use
> isn't much of a win (typically it's running with less than 200 MB of
> RAM which doesn't seem onerous). That said, it is possible that lesk
> could be sped up considerably with a faster string matching algorithm
> (in Perl or C or something else), and it's also possible that there
> might be a more clever way of encoding WordNet so searches through the
> paths it done in real-time are much faster. So there are probably
> areas of the WordNet::Similarity code that could benefit from an
> efficient sub-routine in some other language, and we are very open to
> including something like that in the package.
>
> It is possible that you could run many many similarity/relatedness
> measurements in parallel and pick up some speed that way. Each
> measurement is independent of every other measurement, so it's a very
> decomposable problem (embarrassingly parallel as some like to say) so
> if you have the ability to divide up the work and collect the results
> again that could have some possibilities. You might even consider
> using our server mechanism (similarity_server.pl in the package) as a
> way to handle multiple parallel queries - you could run the server and
> then it can process up to as many children as you care to specify,
> where each child would be computing a similarity pair.
>
> Well, I hope this helps a little, and perhaps gives folks ideas about
> what's happening. I do think it's important to calibrate things a bit,
> and work out how many pairs a second you are able to get in your
> current set up, and then set a specific target for what you need. If
> you are getting 20 pairs a second and need 20,000, that's quite
> different than if you need 100, and each expectation suggests
> different sorts of solutions.
>
> Cordially,
> Ted
>
> On Tue, Feb 1, 2011 at 3:25 AM, Suzan Verberne <s.verberne at let.ru.nl> wrote:
>> Hi all,
>>
>> I have previously been using Pedersen's WordNet Similarity module (
>> http://search.cpan.org/dist/WordNet-Similarity/lib/WordNet/Similarity.pm
>> ) for calculating the similarity or relatedness between pairs of
>> words. Now I started to use it again but I noticed that it is way too
>> slow for a real-time application (which is what I need now).
>>
>> I originally wrote a simple Perl script that calls the module (shown
>> below) but it takes almost five seconds to run. Almost all this time
>> is spent on calling the module so for batch scripts it is fine (then
>> the module is only called once for multiple requests), but I need it
>> to work in real time in a retrieval experiment and then 5 seconds is
>> too long.
>>
>> Does anyone know an alternative (fast!) tool for calculating
>> Similarity and/or Relatedness between two words? It might be using
>> either a Wu & Palmer-like measure or a Lesk-type measure.
>>
>> Thanks!
>> Suzan Verberne
>>
>> #! /usr/bin/perl
>>  use WordNet::QueryData;
>>  use WordNet::Similarity::path;
>>  my $wn = WordNet::QueryData->new;
>>  my $measure = WordNet::Similarity::path->new ($wn);
>>  my $value = $measure->getRelatedness("car#n#1", "bus#n#2");
>>  print "car (sense 1) <-> bus (sense 2) = $value\n";
>>
>>
>> --
>> Suzan Verberne, postdoctoral researcher
>> Centre for Language and Speech Technology
>> Radboud University Nijmegen
>> Tel: +31 24 3611134
>> Email: s.verberne at let.ru.nl
>> http://lands.let.ru.nl/~sverbern/
>> --
>>
>> _______________________________________________
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>>
>
>
>
> --
> Ted Pedersen
> http://www.d.umn.edu/~tpederse
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>



-- 
Linguist, Desi Linguist
http://www.desilinguist.org

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list