[Corpora-List] semantic similarity

Dominic Widdows widdows at maya.com
Thu Jan 20 19:04:22 UTC 2005


Dear Jana,

Some of the infomap project's tools and methods may also help you -
links to demos, software and many papers are available from
http://infomap.stanford.edu

The main piece of software available performs latent semantic analysis
(there's a demo at infomap.stanford.edu/webdemo). While the current
demo requires that you input an initial set of query terms, the
software does build a dictionary file and it would be very easy to
iterate through this and output pairs of terms whose latent semantic
similarity is above a given threshold. (We have done this is the past
for harvesting translation pairs from parallel corpora). We have also
found LSA to be a very useful filter for relationships extracted by
other means (for example, if you have two strings with similar
orthography you can check using LSA to see if they are also
contextually similar).

If any of the above material sounds useful to you let me know and I may
be able to help with more details.
I too am in Pittsburgh - must be a large part of a small world :)
Best wishes,
Dominic

> Hi Jana,
>
> have you looked at Latent Dirichlet Allocation, developed by Blei,
> Jordan and Ng? Take a look at Blei's homepage:
> http://www.cs.berkeley.edu/~blei/
>
> in particular,
> Latent Dirichlet allocation. D. Blei, A. Ng, and M. Jordan. Journal of
> Machine Learning Research, 3:993-1022, January 2003.
>
> Dave Blei is now a postdoc at CMU, and I'm a grad student here -- so
> feel free to stop by.
>
> Best,
> -Leo
>
> On Thu, 20 Jan 2005, Jana Diesner wrote:
>
>> Dear list members,
>>
>> We are looking for strategies, algorithms or code to automatically
>> find
>> single terms or multiple adjacent terms that are semantically similar
>> within
>> and across documents. The approach must not require POS tagging or an
>> initial input of a reference term to the system. The resulting
>> clusters of
>> semantically similar terms suggested by the system do not need to be
>> exclusive. We are familiar with secondstring, the software developed
>> by
>> William Cohen, and semantic similarity based on string-edit distances.
>>
>>
>>
>> Thank you very much.
>>
>> Jana
>>
>>
>>
>> ____________________
>>
>> Jana Diesner
>> Carnegie Mellon University
>>
>> jdiesner at andrew.cmu.edu
>
>



More information about the Corpora mailing list