[Corpora-List] Faster tool for WordNet Similarity measures
Ted Pedersen
tpederse at d.umn.edu
Tue Feb 1 18:06:48 UTC 2011
Hi Suzan,
If you locate such an alternative I'm absolutely dying to know. :)
It would also be helpful to know how fast would be considered *fast*
or fast enough. I did a quick little benchmark using 10,000 randomly
generated noun pairs (using our randomPairs.pl program from
http://www.d.umn.edu/~tpederse/wordnet.html ) I used a rather elderly
machine with a 2.0 GhZ Xeon processor, and ran Wu and Palmer (wup),
Adapted Lesk (lesk) and Shortest Path (path) on those 10,000 pairs.
path finished in 153 seconds, so about 65 pairs a second
wup finished in 268 seconds, so about 37 pairs a second
lesk finished in 434 second, so about 27 pairs a second
Now this isn't blazing fast, but on the other it doesn't strike me as
unreasonably slow either. What would be your target for pairs per
second? That could help figure out an appropriate solution...Also, if
you are finding that it takes considerably longer than the times I
report above, please do let me know as there might be some other issue
that is slowing you down.
For those who are truly interested in WordNet::Similarity performance,
here are a few more thoughts - I think about this issue regularly, so
any ideas or alternative suggestions are very much appreciated.
You are right in noting that the "load time" is the big factor with a
small number of measurements, and that processing 1 pair is much much
more expensive than processing 100 or 1000. Performance related issues
like this have been discussed a bit on the WordNet Similarity mailing
list, although it sounds like you already understand the basics pretty
well. Here's a sampling of those discussions....
http://www.mail-archive.com/wn-similarity@yahoogroups.com/msg00068.html
http://www.mail-archive.com/wn-similarity@yahoogroups.com/msg00226.html
http://www.mail-archive.com/wn-similarity@yahoogroups.com/msg00474.html
http://www.mail-archive.com/wn-similarity@yahoogroups.com/msg00499.html
There are a few things to consider - lesk is among the slower of the
measures, because it does string overlap matching for each pair of
extended glosses that are constructed. These extended glosses can be
rather big (hundreds of words) so finding the string overlaps is
rather time consuming. Wu and Palmer (wup), on the other hand, is
fairly swift in that it relies on depth information (for each sense)
and we pre-compute those at run time (how deep is each concept is
found once and then re-used). So overall that's a fairly nice choice
in terms of speed. As you can see above path is even quicker, because
it's only finding shortest paths between the senses.
It's often suggested that we should pre-compute the similarity and
relatedness values, and make them available as some sort of table or
database. That would certainly speed things up, but it's also a rather
enormous task in that if we are talking about roughly a 200,000 x
200,000 relatedness matrix for lesk, vector, vector-pairs and hso.
Those measures can be computing between any parts of speech, so you
need to compute all the pairwise relatedness values for everything in
WordNet. Given that the relatedness measures are generally the slowest
it would be a time consuming operation to build that matrix.
The similarity measures on the other hand, are slightly more tractable
in that they are limited to doing noun-noun and verb-verb comparisons.
So that means we could pre-compute 140,000 x 140,000 matrices for the
nouns, and 25,000 x 25,000 for the verbs. In fact doing the verbs
isn't so bad, and if there's any interest I could surely run something
to generate those matrices. The similarity nouns and the relatedness
measures themselves would require a bit more patience. Note the
matrices are symmetric, so it's not *quite* as bad as it first
appears.
I should also say that a number of people have said they were going to
try and do this sort of pre-computation of values, but I can't recall
if anyone actually finished that and/or made it available. If so that
could be something to consider (and it might be nice for anyone who
has done that to remind us as it might be generally pretty useful).
The downside to pre-computing is that it seems to me there are many
options that can be used for some of the measures (all the relatedness
measures and the info content measures for sure) so a pre-computed set
of values would only reflect one particular way of using those
measures - so, one size may not fit all. For the path based similarity
measures (path, wup, lch) I don't think this is much of an issue as
there aren't so many options, and so in general everyone uses those in
the same way and a pre-computed table of similarity measures might be
very helpful.
Other suggestions have included - rewrite in Java, Python, C++, C,
Fortran, anything but Perl!! :) - I don't know if I see much advantage
to that in terms of speed - Perl uses a lot of memory and that can
cause performance to lag in some situations, but in general
WordNet::Similarity doesn't use much memory so reducing memory use
isn't much of a win (typically it's running with less than 200 MB of
RAM which doesn't seem onerous). That said, it is possible that lesk
could be sped up considerably with a faster string matching algorithm
(in Perl or C or something else), and it's also possible that there
might be a more clever way of encoding WordNet so searches through the
paths it done in real-time are much faster. So there are probably
areas of the WordNet::Similarity code that could benefit from an
efficient sub-routine in some other language, and we are very open to
including something like that in the package.
It is possible that you could run many many similarity/relatedness
measurements in parallel and pick up some speed that way. Each
measurement is independent of every other measurement, so it's a very
decomposable problem (embarrassingly parallel as some like to say) so
if you have the ability to divide up the work and collect the results
again that could have some possibilities. You might even consider
using our server mechanism (similarity_server.pl in the package) as a
way to handle multiple parallel queries - you could run the server and
then it can process up to as many children as you care to specify,
where each child would be computing a similarity pair.
Well, I hope this helps a little, and perhaps gives folks ideas about
what's happening. I do think it's important to calibrate things a bit,
and work out how many pairs a second you are able to get in your
current set up, and then set a specific target for what you need. If
you are getting 20 pairs a second and need 20,000, that's quite
different than if you need 100, and each expectation suggests
different sorts of solutions.
Cordially,
Ted
On Tue, Feb 1, 2011 at 3:25 AM, Suzan Verberne <s.verberne at let.ru.nl> wrote:
> Hi all,
>
> I have previously been using Pedersen's WordNet Similarity module (
> http://search.cpan.org/dist/WordNet-Similarity/lib/WordNet/Similarity.pm
> ) for calculating the similarity or relatedness between pairs of
> words. Now I started to use it again but I noticed that it is way too
> slow for a real-time application (which is what I need now).
>
> I originally wrote a simple Perl script that calls the module (shown
> below) but it takes almost five seconds to run. Almost all this time
> is spent on calling the module so for batch scripts it is fine (then
> the module is only called once for multiple requests), but I need it
> to work in real time in a retrieval experiment and then 5 seconds is
> too long.
>
> Does anyone know an alternative (fast!) tool for calculating
> Similarity and/or Relatedness between two words? It might be using
> either a Wu & Palmer-like measure or a Lesk-type measure.
>
> Thanks!
> Suzan Verberne
>
> #! /usr/bin/perl
> use WordNet::QueryData;
> use WordNet::Similarity::path;
> my $wn = WordNet::QueryData->new;
> my $measure = WordNet::Similarity::path->new ($wn);
> my $value = $measure->getRelatedness("car#n#1", "bus#n#2");
> print "car (sense 1) <-> bus (sense 2) = $value\n";
>
>
> --
> Suzan Verberne, postdoctoral researcher
> Centre for Language and Speech Technology
> Radboud University Nijmegen
> Tel: +31 24 3611134
> Email: s.verberne at let.ru.nl
> http://lands.let.ru.nl/~sverbern/
> --
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
--
Ted Pedersen
http://www.d.umn.edu/~tpederse
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list