[Corpora-List] software for measuring semantic similarity and relatedness?
Djamé Seddah
djame.seddah at free.fr
Mon Oct 7 14:54:59 UTC 2013
Hi,
to (try) to see how similar were some corpora (news vs twitter, news vs facebook, etc..)
we used a variant of the Kullback–Leibler divergence that compares the distribution of treegrams of characters.
Actually, it worked suprisingly very well.
See our 2012 paper
"The French Social Media Bank: a Treebank of Noisy User Generated Content"
(Djamé Seddah, Benoit Sago, Marie Candito, Virginie Mouilleron, Vanessa Combet -- Coling 20120)
http://aclweb.org/anthology/C/C12/C12-1149.pdf
Best,
Djamé
Le 7 oct. 2013 à 16:27, Juan Fernández Fernández a écrit :
>
> Hello,
>
> I am also interested in the topic, but in a simpler way. I would like to measure word-based (not sense-based) similarity - that is, sentences that share the same words (lemmas), excluding stopwords. As I need to preprocess twitter sentiment corpora, I was wondering if there are tools to detect word similarity, as in spam or repetitive twitter messages. Does anybody know anything for Spanish?
>
> Thank you very much,
>
> Juan F.
>
>
>
> El 07/10/2013, a las 15:44, Eneko Agirre escribió:
>
>>
>>
>> Hi Ted and all,
>>
>> you might want to check http://ixa2.si.ehu.es/ukb/, a graph-based algorithm for WSD and similarity,which uses random walks. It scores very high in RG65 and WordSim353 when run on WordNet, and can be applied to any KB.
>>
>> It's open source and includes all data necessary to replicate the results reported in the following:
>>
>> [3] Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Pasca and Aitor Soroa. 2009. A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches. Proceedings of NAACL-HLT 09. Boulder, USA. (PDF)
>>
>> [4] Eneko Agirre, Montse Cuadros, German Rigau and Aitor Soroa. 2010. Exploring Knowledge Bases for Similarity. Proceedings of LREC 2010. Valletta, Malta. (PDF)
>>
>> best
>>
>> eneko
>>
>>
>>
>> 10/06/2013 05:45 PM(e)an, Ted Pedersen(e)k idatzi zuen:
>>> Greetings all,
>>>
>>> I'm preparing a tutorial on measuring semantic similarity and
>>> relatedness between concepts, My particular focus is on methods that
>>> do this using ontologies or other (at least somewhat) structured
>>> resources (like Wikipedia, folksonomies, etc.) and that also have
>>> freely available software associated with them (or at least a web
>>> demo).
>>>
>>> While it's a very interesting area, this particular tutorial won't
>>> include purely distributional approaches (due to time constraints), so
>>> I'm looking for methods and software that use some sort of resource
>>> like WordNet, Wikipedia, medical ontologies, Freebase, etc. to arrive
>>> at measurements of semantic similarity or relatedness between pairs of
>>> concepts.
>>>
>>> What follows is my current list, based not only on projects I have
>>> heard of but have used in the not too distant past - so I guess I'm
>>> particularly interested in projects you have used or created yourself
>>> (and can therefore vouch for to some extent).
>>>
>>> Based on WordNet, provide path, depth, info content based measures,
>>> may include relatedness measures like lesk, vector, hso
>>>
>>> WordNet::Similarity
>>> http://wn-similarity.sourcforge.net
>>>
>>> NLTK
>>> http://nltk.org
>>>
>>> ws4j
>>> https://code.google.com/p/ws4j/
>>>
>>> Based on UMLS (Unified Medical Language System), provide path, depth,
>>> info content measures, includes relatedness measures lesk, vector
>>>
>>> UMLS::Similarity
>>> http://umls-similarity.sourceforge.net
>>>
>>> Based on (GO), provide path, depth, and info content measures
>>>
>>> Proteinon
>>> http://lasige.di.fc.ul.pt/webtools/proteinon/
>>>
>>> I will post a summary of whatever I hear about after some period of
>>> time. Any hints or suggestions will be very gratefully received.
>>>
>>> Many thanks,
>>> Ted
>>>
>>
>>
>> --
>>
>> Eneko Agirre
>> Euskal Herriko Unibertsitatea
>> University of the Basque Country
>> http://ixa2.si.ehu.es/eneko
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list