[Corpora-List] SenseClusters v0.95 released (now supports LSA)

ted pedersen tpederse at d.umn.edu
Sat Aug 26 18:08:04 UTC 2006


We are pleased to announce the release of SenseClusters version 0.95.   

SenseClusters is a freely available package that allows you to cluster    
similar contexts, or to identify clusters of related words. It is fully   
unsupervised, and can automatically discover the optimal number of  
clusters in your text. 

As of version 0.95, we now fully support Latent Semantic Analysis for      
context and word clustering, and we continue to improve the native   
SenseClusters methods, which include the ability to cluster first and  
second order representations of context.

SenseClusters can be downloaded from :

	http://senseclusters.sourceforge.net/

You can also try out SenseClusters via our web interface:

	http://marimba.d.umn.edu/cgi-bin/SC-cgi/index.cgi

In both native and LSA modes, SenseClusters relies on lexical features   
(such as unigrams, bigrams, and co--occurrences) that can be identified 
in raw text. The tokenization is very flexible and can be defined via 
Perl regular expressions, so it is possible to work with many other  
languages besides English, and you can easily work with tokenization    
schemes other than white-space separated words, such as character based 
tokens, like 2 letter sequences, etc.

The native SenseClusters methods support traditional first order context    
clustering, where you identify a feature set, and then determine which of  
those features occur in the contexts you are clustering. The native   
methods also support second order context clustering, where each word 
is represented by a vector of the words with which it co-occurs. 
All the words in a context to be clustered are replaced by their 
associated vectors, and these vectors are averaged together to represent 
that context. Note that you can also cluster the word vectors to identify 
sets of related words. 

Latent Semantic Analysis differs from the native SenseClusters methods in  
that each feature is represented by a vector that shows the contexts in  
which that feature occurs. Then, all the features in a context to be   
clustered are replaced by their associated vectors, and these are  
averaged together to represent the context. Note that you can also  
cluster the feature vectors directly to identify sets of related features. 

This release represents a major step forward in the functionality of    
SenseClusters. Much of work in providing LSA support was carried out by  
Mahesh Joshi this spring and summer. And like always during the last two 
years, Anagha Kulkarni played a large role in this release, and has 
provided a wide range of improvements in automatic cluster stopping and 
other areas. 

Please give this a try, and let us know if you have any comments or 
questions! If you aren't certain if your problem can be approached using 
SenseClusters, please let us know what you would like to do and maybe we 
can help you get started. 

Cordially,
Ted, Anagha, and Mahesh

====================================================================

ChangeLog:
http://www.d.umn.edu/~tpederse/Code/Changelog.SenseClusters-v0.95.txt

Installation Instructions: 
http://www.d.umn.edu/~tpederse/Code/SenseClusters-v0.95-INSTALL.txt

Related Publications (includes links to data you can use):
http://www.d.umn.edu/~tpederse/senseclusters-pubs.html

--
Ted Pedersen
http://www.d.umn.edu/~tpederse



More information about the Corpora mailing list