[Corpora-List] software for measuring semantic similarity and relatedness?

Ted Pedersen tpederse at d.umn.edu
Tue Oct 29 18:54:58 UTC 2013


Thanks to all who responded to my request for information about freely
available packages to compute semantic similarity and relatedness
using some sort of ontology or structured resource.

Below is my best attempt at a summary - I have tried to be accurate
here, but please if I've errored in how something is described (or
have messed up a URL) please do let me know. And of course, if there
are additions that should be made to this list, I'd be more than happy
to learn of those and include both in this list and in the tutorial
that motivated my original request. And my sincere apologies if
someone sent me something that isn't included here - as long as there
was an implementation that could be downloaded or accessed via the
web, I intended to include that here (so please don't hesitate to
remind me).

I've divided the responses up into three categories.

1) packages that provide a variety of measures (and normally include
multiple measures that were developed by someone else, and then
implemented by the package authors perhaps along with a few of their
own measures)

2) implementations of specific measures

3) gold standard human similarity and relatedness judgements

Note that 3) wasn't included in my original request, but came about as
a result of asking about the first two, so I thought I would include
that information as well.

================================================
Systems that provide a variety of measures :
================================================

Based on WordNet and include measures based on path length, depth,
information content, and may include relatedness measures like lesk,
vector, hso

1) WordNet::Similarity http://wn-similarity.sourceforge.net

2) NLTK http://nltk.org

3) ws4j https://code.google.com/p/ws4j/

4) DKPro https://code.google.com/p/dkpro-similarity-asl/ (also
includes support for Wikipedia/Wikirelate, Wiktionary, openThesaurus,
GermaNet)

Based on various medical ontologies

1) UMLS::Similarity http://umls-similarity.sourceforge.net (based on
Unified Medical Language System)

2) Proteinon http://lasige.di.fc.ul.pt/webtools/proteinon/ (based on
Gene Ontology)

Systems where the focus may be on other issues but that still include
some support of semantic similarity and relatedness measures between
words/concepts

1) Disco http://www.linguatools.de/disco/disco_en.html (co-occurrence
/ corpus based similarity, but also includes plug-in for ontologies in
Protege)

2) Semilar http://semanticsimilarity.org/ (text to text similarity but
also includes support for word to word similarity)

=================================================
Implementations of Specific measures :
=================================================

1) UKB http://ixa2.si.ehu.es/ukb/ (graph based similarity and
relatedness, using WordNet)

2) http://www.cs.columbia.edu/~weiwei/code.html#wmfvec (high
dimensional approach using definitions from WordNet/Wiktionary)

3) http://olesk.com/#SemanticRelatedness (shortest path in weighted
semantic network)

==============================================================================
Gold Standard data sets with human similarity and relatedness judgements :
==============================================================================

1) Yang and Powers 2006 Verb Similarity Scores (130 pairs)

http://david.wardpowers.info/Research/AI/papers/200601-GWC-VerbSimWN.pdf
http://david.wardpowers.info/Research/AI/papers/200601-GWC-130verbpairs.txt

2) WordSimilarity 353 Test Collection

http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/

http://alfonseca.org/eng/research/wordsim353.html (divided into
similarity and relatedness pairs)

3) Rubenstein and Goodenough (65 pairs) Miller and Charles (30 pair
subset of RG)

http://www.d.umn.edu/~tpederse/Data/rubenstein-goodenough-1965.txt

http://www.d.umn.edu/~tpederse/Data/miller-charles-1991.txt

4) ConceptSim (sense annotated versions of MC,RG, and WordSim 353)

http://www.seas.upenn.edu/~hansens/conceptSim/

5) Medical concepts from UMLS

http://rxinformatics.umn.edu/SemanticRelatednessResources.html

Four different data sets, one with 101 pairs, another made up of a
subset of 30 of those (both rated for relatedness), annother with 566
pairs rated for similarity, and another with 587 pairs rated for
relatedness.

========================================================================

So, that's what I have at this point. Additional contributions,
clarifications, etc. are certainly welcomed!

Cordially,
Ted

On Sun, Oct 6, 2013 at 10:50 AM, Ted Pedersen <tpederse at d.umn.edu> wrote:
> Well I managed to misspell my own URL :)
>
> WordNet::Similarity
> http://wn-similarity.sourceforge.net
>
> All the others appear to be correct.
>
> On Sun, Oct 6, 2013 at 10:45 AM, Ted Pedersen <tpederse at d.umn.edu> wrote:
>> Greetings all,
>>
>> I'm preparing a tutorial on measuring semantic similarity and
>> relatedness between concepts, My particular focus is on methods that
>> do this using ontologies or other (at least somewhat) structured
>> resources (like Wikipedia, folksonomies, etc.) and that also have
>> freely available software associated with them (or at least a web
>> demo).
>>
>> While it's a very interesting area, this particular tutorial won't
>> include purely distributional approaches (due to time constraints), so
>> I'm looking for methods and software that use some sort of resource
>> like WordNet, Wikipedia, medical ontologies, Freebase, etc. to arrive
>> at measurements of semantic similarity or relatedness between pairs of
>> concepts.
>>
>> What follows is my current list, based not only on projects I have
>> heard of but have used in the not too distant past - so I guess I'm
>> particularly interested in projects you have used or created yourself
>> (and can therefore vouch for to some extent).
>>
>> Based on WordNet, provide path, depth, info content based measures,
>> may include relatedness measures like lesk, vector, hso
>>
>> WordNet::Similarity
>> http://wn-similarity.sourcforge.net
>>
>> NLTK
>> http://nltk.org
>>
>> ws4j
>> https://code.google.com/p/ws4j/
>>
>> Based on UMLS (Unified Medical Language System), provide path, depth,
>> info content measures, includes relatedness measures lesk, vector
>>
>> UMLS::Similarity
>> http://umls-similarity.sourceforge.net
>>
>> Based on (GO), provide path, depth, and info content measures
>>
>> Proteinon
>> http://lasige.di.fc.ul.pt/webtools/proteinon/
>>
>> I will post a summary of whatever I hear about after some period of
>> time. Any hints or suggestions will be very gratefully received.
>>
>> Many thanks,
>> Ted
>>
>> --
>> Ted Pedersen
>> http://www.d.umn.edu/~tpederse
>
>
>
> --
> Ted Pedersen
> http://www.d.umn.edu/~tpederse



-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list