[Corpora-List] software for measuring semantic similarity and relatedness?

Djamé Seddah djame.seddah at free.fr
Mon Oct 7 14:54:59 UTC 2013


Hi,
to (try) to see how similar were some corpora (news vs twitter, news vs facebook, etc..)
we used a variant of the Kullback–Leibler divergence that compares the distribution of treegrams of characters.
Actually, it worked suprisingly very well.

See our 2012 paper
"The French Social Media Bank: a Treebank of Noisy User Generated Content"
(Djamé Seddah, Benoit Sago,  Marie Candito, Virginie Mouilleron, Vanessa Combet -- Coling 20120)

http://aclweb.org/anthology/C/C12/C12-1149.pdf



Best,
Djamé 


Le 7 oct. 2013 à 16:27, Juan Fernández Fernández a écrit :

> 
> Hello,
> 
> I am also interested in the topic, but in a simpler way. I would like to measure word-based (not sense-based) similarity - that is, sentences that share the same words (lemmas), excluding stopwords. As I need to preprocess twitter sentiment corpora, I was wondering if there are tools to detect word similarity, as in spam or repetitive twitter messages. Does anybody know anything for Spanish?
> 
> Thank you very much,
> 
> Juan F.
> 
> 
> 
> El 07/10/2013, a las 15:44, Eneko Agirre escribió:
> 
>> 
>> 
>> Hi Ted and all,
>> 
>> you might want to check http://ixa2.si.ehu.es/ukb/, a graph-based algorithm for WSD and similarity,which uses random walks. It scores very high in RG65 and WordSim353 when run on WordNet, and can be applied to any KB.
>> 
>> It's open source and includes all data necessary to replicate the results reported in the following:
>> 
>> [3] Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Pasca and Aitor Soroa. 2009. A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches. Proceedings of NAACL-HLT 09. Boulder, USA.  (PDF)
>> 
>> [4] Eneko Agirre, Montse Cuadros, German Rigau and Aitor Soroa. 2010.  Exploring Knowledge Bases for Similarity. Proceedings of LREC 2010. Valletta, Malta.  (PDF)
>> 
>> best
>> 
>> eneko
>> 
>> 
>> 
>> 10/06/2013 05:45 PM(e)an, Ted Pedersen(e)k idatzi zuen:
>>> Greetings all,
>>> 
>>> I'm preparing a tutorial on measuring semantic similarity and
>>> relatedness between concepts, My particular focus is on methods that
>>> do this using ontologies or other (at least somewhat) structured
>>> resources (like Wikipedia, folksonomies, etc.) and that also have
>>> freely available software associated with them (or at least a web
>>> demo).
>>> 
>>> While it's a very interesting area, this particular tutorial won't
>>> include purely distributional approaches (due to time constraints), so
>>> I'm looking for methods and software that use some sort of resource
>>> like WordNet, Wikipedia, medical ontologies, Freebase, etc. to arrive
>>> at measurements of semantic similarity or relatedness between pairs of
>>> concepts.
>>> 
>>> What follows is my current list, based not only on projects I have
>>> heard of but have used in the not too distant past - so I guess I'm
>>> particularly interested in projects you have used or created yourself
>>> (and can therefore vouch for to some extent).
>>> 
>>> Based on WordNet, provide path, depth, info content based measures,
>>> may include relatedness measures like lesk, vector, hso
>>> 
>>> WordNet::Similarity
>>> http://wn-similarity.sourcforge.net
>>> 
>>> NLTK
>>> http://nltk.org
>>> 
>>> ws4j
>>> https://code.google.com/p/ws4j/
>>> 
>>> Based on UMLS (Unified Medical Language System), provide path, depth,
>>> info content measures, includes relatedness measures lesk, vector
>>> 
>>> UMLS::Similarity
>>> http://umls-similarity.sourceforge.net
>>> 
>>> Based on (GO), provide path, depth, and info content measures
>>> 
>>> Proteinon
>>> http://lasige.di.fc.ul.pt/webtools/proteinon/
>>> 
>>> I will post a summary of whatever I hear about after some period of
>>> time. Any hints or suggestions will be very gratefully received.
>>> 
>>> Many thanks,
>>> Ted
>>> 
>> 
>> 
>> -- 
>> 
>> Eneko Agirre
>> Euskal Herriko Unibertsitatea
>> University of the Basque Country
>> http://ixa2.si.ehu.es/eneko
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
> 
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora


_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list