[Corpora-List] software for measuring semantic similarity and relatedness?

Ted Pedersen tpederse at d.umn.edu
Sun Nov 17 22:51:39 UTC 2013


Well, the site I gave really focuses more on results (with links to
data). The following page seems like an intended home for data,
although it's rather broad in scope...

http://aclweb.org/aclwiki/index.php?title=Knowledge_collections_and_datasets_(English)

So, maybe the best course would be to reorganize this datasets page a
bit (perhaps separately datasets a bit based on the kind of data), and
then link from other pages when that data is used...? I guess someone
needs to "be bold" (as the wiki folks like to say) and just do
something. I'll see how bold I'm feeling when I get to that point. :)

Cordially,
Ted

On Sun, Nov 17, 2013 at 4:45 PM, Ted Pedersen <tpederse at d.umn.edu> wrote:
> Hi Francis,
>
> Sorry for the delay in reply, but yes, I agree. I think we should work
> on getting a somewhat centralized list for these sorts of data
> sources. I think the following is a good possible home, and I'll do
> some updating there in the near future. And of course, if anyone else
> would like to the same that would be even better!
>
> http://aclweb.org/aclwiki/index.php?title=Similarity_(State_of_the_art)
>
> Cordially,
> Ted
>
> On Tue, Oct 29, 2013 at 9:06 PM, Francis Bond <bond at ieee.org> wrote:
>> G'day,
>>
>> I think these summaries are great!  Have you considered adding them to
>> the aclwiki?  I think it is a good place to make these widely
>> available (I try to keep the Japanese corpora page up-to-date).
>>
>> On Wed, Oct 30, 2013 at 2:39 AM, manaal faruqui <manaalfar at gmail.com> wrote:
>>> I have recently assembled (under  construction) a list of all the available
>>> lexical semantic evaluation benchmarks that people have been using in their
>>> research. Hope people will find it useful!
>>>
>>> http://www.cs.cmu.edu/~mfaruqui/suite.html
>>>
>>> Manaal
>>>
>>>
>>> On Tue, Oct 29, 2013 at 2:54 PM, Ted Pedersen <tpederse at d.umn.edu> wrote:
>>>>
>>>> Thanks to all who responded to my request for information about freely
>>>> available packages to compute semantic similarity and relatedness
>>>> using some sort of ontology or structured resource.
>>>>
>>>> Below is my best attempt at a summary - I have tried to be accurate
>>>> here, but please if I've errored in how something is described (or
>>>> have messed up a URL) please do let me know. And of course, if there
>>>> are additions that should be made to this list, I'd be more than happy
>>>> to learn of those and include both in this list and in the tutorial
>>>> that motivated my original request. And my sincere apologies if
>>>> someone sent me something that isn't included here - as long as there
>>>> was an implementation that could be downloaded or accessed via the
>>>> web, I intended to include that here (so please don't hesitate to
>>>> remind me).
>>>>
>>>> I've divided the responses up into three categories.
>>>>
>>>> 1) packages that provide a variety of measures (and normally include
>>>> multiple measures that were developed by someone else, and then
>>>> implemented by the package authors perhaps along with a few of their
>>>> own measures)
>>>>
>>>> 2) implementations of specific measures
>>>>
>>>> 3) gold standard human similarity and relatedness judgements
>>>>
>>>> Note that 3) wasn't included in my original request, but came about as
>>>> a result of asking about the first two, so I thought I would include
>>>> that information as well.
>>>>
>>>> ================================================
>>>> Systems that provide a variety of measures :
>>>> ================================================
>>>>
>>>> Based on WordNet and include measures based on path length, depth,
>>>> information content, and may include relatedness measures like lesk,
>>>> vector, hso
>>>>
>>>> 1) WordNet::Similarity http://wn-similarity.sourceforge.net
>>>>
>>>> 2) NLTK http://nltk.org
>>>>
>>>> 3) ws4j https://code.google.com/p/ws4j/
>>>>
>>>> 4) DKPro https://code.google.com/p/dkpro-similarity-asl/ (also
>>>> includes support for Wikipedia/Wikirelate, Wiktionary, openThesaurus,
>>>> GermaNet)
>>>>
>>>> Based on various medical ontologies
>>>>
>>>> 1) UMLS::Similarity http://umls-similarity.sourceforge.net (based on
>>>> Unified Medical Language System)
>>>>
>>>> 2) Proteinon http://lasige.di.fc.ul.pt/webtools/proteinon/ (based on
>>>> Gene Ontology)
>>>>
>>>> Systems where the focus may be on other issues but that still include
>>>> some support of semantic similarity and relatedness measures between
>>>> words/concepts
>>>>
>>>> 1) Disco http://www.linguatools.de/disco/disco_en.html (co-occurrence
>>>> / corpus based similarity, but also includes plug-in for ontologies in
>>>> Protege)
>>>>
>>>> 2) Semilar http://semanticsimilarity.org/ (text to text similarity but
>>>> also includes support for word to word similarity)
>>>>
>>>> =================================================
>>>> Implementations of Specific measures :
>>>> =================================================
>>>>
>>>> 1) UKB http://ixa2.si.ehu.es/ukb/ (graph based similarity and
>>>> relatedness, using WordNet)
>>>>
>>>> 2) http://www.cs.columbia.edu/~weiwei/code.html#wmfvec (high
>>>> dimensional approach using definitions from WordNet/Wiktionary)
>>>>
>>>> 3) http://olesk.com/#SemanticRelatedness (shortest path in weighted
>>>> semantic network)
>>>>
>>>>
>>>> ==============================================================================
>>>> Gold Standard data sets with human similarity and relatedness judgements :
>>>>
>>>> ==============================================================================
>>>>
>>>> 1) Yang and Powers 2006 Verb Similarity Scores (130 pairs)
>>>>
>>>> http://david.wardpowers.info/Research/AI/papers/200601-GWC-VerbSimWN.pdf
>>>>
>>>> http://david.wardpowers.info/Research/AI/papers/200601-GWC-130verbpairs.txt
>>>>
>>>> 2) WordSimilarity 353 Test Collection
>>>>
>>>> http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/
>>>>
>>>> http://alfonseca.org/eng/research/wordsim353.html (divided into
>>>> similarity and relatedness pairs)
>>>>
>>>> 3) Rubenstein and Goodenough (65 pairs) Miller and Charles (30 pair
>>>> subset of RG)
>>>>
>>>> http://www.d.umn.edu/~tpederse/Data/rubenstein-goodenough-1965.txt
>>>>
>>>> http://www.d.umn.edu/~tpederse/Data/miller-charles-1991.txt
>>>>
>>>> 4) ConceptSim (sense annotated versions of MC,RG, and WordSim 353)
>>>>
>>>> http://www.seas.upenn.edu/~hansens/conceptSim/
>>>>
>>>> 5) Medical concepts from UMLS
>>>>
>>>> http://rxinformatics.umn.edu/SemanticRelatednessResources.html
>>>>
>>>> Four different data sets, one with 101 pairs, another made up of a
>>>> subset of 30 of those (both rated for relatedness), annother with 566
>>>> pairs rated for similarity, and another with 587 pairs rated for
>>>> relatedness.
>>>>
>>>> ========================================================================
>>>>
>>>> So, that's what I have at this point. Additional contributions,
>>>> clarifications, etc. are certainly welcomed!
>>>>
>>>> Cordially,
>>>> Ted
>>>>
>>>> On Sun, Oct 6, 2013 at 10:50 AM, Ted Pedersen <tpederse at d.umn.edu> wrote:
>>>> > Well I managed to misspell my own URL :)
>>>> >
>>>> > WordNet::Similarity
>>>> > http://wn-similarity.sourceforge.net
>>>> >
>>>> > All the others appear to be correct.
>>>> >
>>>> > On Sun, Oct 6, 2013 at 10:45 AM, Ted Pedersen <tpederse at d.umn.edu>
>>>> > wrote:
>>>> >> Greetings all,
>>>> >>
>>>> >> I'm preparing a tutorial on measuring semantic similarity and
>>>> >> relatedness between concepts, My particular focus is on methods that
>>>> >> do this using ontologies or other (at least somewhat) structured
>>>> >> resources (like Wikipedia, folksonomies, etc.) and that also have
>>>> >> freely available software associated with them (or at least a web
>>>> >> demo).
>>>> >>
>>>> >> While it's a very interesting area, this particular tutorial won't
>>>> >> include purely distributional approaches (due to time constraints), so
>>>> >> I'm looking for methods and software that use some sort of resource
>>>> >> like WordNet, Wikipedia, medical ontologies, Freebase, etc. to arrive
>>>> >> at measurements of semantic similarity or relatedness between pairs of
>>>> >> concepts.
>>>> >>
>>>> >> What follows is my current list, based not only on projects I have
>>>> >> heard of but have used in the not too distant past - so I guess I'm
>>>> >> particularly interested in projects you have used or created yourself
>>>> >> (and can therefore vouch for to some extent).
>>>> >>
>>>> >> Based on WordNet, provide path, depth, info content based measures,
>>>> >> may include relatedness measures like lesk, vector, hso
>>>> >>
>>>> >> WordNet::Similarity
>>>> >> http://wn-similarity.sourcforge.net
>>>> >>
>>>> >> NLTK
>>>> >> http://nltk.org
>>>> >>
>>>> >> ws4j
>>>> >> https://code.google.com/p/ws4j/
>>>> >>
>>>> >> Based on UMLS (Unified Medical Language System), provide path, depth,
>>>> >> info content measures, includes relatedness measures lesk, vector
>>>> >>
>>>> >> UMLS::Similarity
>>>> >> http://umls-similarity.sourceforge.net
>>>> >>
>>>> >> Based on (GO), provide path, depth, and info content measures
>>>> >>
>>>> >> Proteinon
>>>> >> http://lasige.di.fc.ul.pt/webtools/proteinon/
>>>> >>
>>>> >> I will post a summary of whatever I hear about after some period of
>>>> >> time. Any hints or suggestions will be very gratefully received.
>>>> >>
>>>> >> Many thanks,
>>>> >> Ted
>>>> >>
>>>> >> --
>>>> >> Ted Pedersen
>>>> >> http://www.d.umn.edu/~tpederse
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > Ted Pedersen
>>>> > http://www.d.umn.edu/~tpederse
>>>>
>>>>
>>>>
>>>> --
>>>> Ted Pedersen
>>>> http://www.d.umn.edu/~tpederse
>>>>
>>>> _______________________________________________
>>>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>>>> Corpora mailing list
>>>> Corpora at uib.no
>>>> http://mailman.uib.no/listinfo/corpora
>>>
>>>
>>>
>>> _______________________________________________
>>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>>> Corpora mailing list
>>> Corpora at uib.no
>>> http://mailman.uib.no/listinfo/corpora
>>>
>>
>>
>>
>> --
>> Francis Bond <http://www3.ntu.edu.sg/home/fcbond/>
>> Division of Linguistics and Multilingual Studies
>> Nanyang Technological University
>
>
>
> --
> Ted Pedersen
> http://www.d.umn.edu/~tpederse



-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list