<html>
<body>
At 2004-05-11 03:24, you wrote:<br>
<blockquote type=cite class=cite cite>Dear all,<br><br>
Does anyone know of a tool (or algorithm), preferably available
freely<br>
for research purposes, that takes as its input a corpus only and<br>
produces as its output clusters of tokens that occur close to each
other<br>
relatively often?</blockquote><br>
I created such a software but it is a commercial product. You already
obtained suggestions for links to free clustering routines. If you
can't find something that suits you in public domain software, take a
look at our software
(<a href="http://www.simstat.com/wordstat.htm" eudora="autourl">www.simstat.com/wordstat.htm</a>)<br><br>
However I hope you won't mind me asking a few questions related to the
usefulness of clustering corpus. There are hundreds of ways of
clustering the same text corpus. To give you just a few examples of
variations:
<ul>
<li>You can define proximity as two words occurring in a sentence, in a
paragraph, an entire text, or a small window of words. Each of them
will give you different pictures, different information.
<li>You can use many kinds of similarity indices (some based on mere
occurrence like the Jaccard, often used in the clustering of text, or
based on frequencies like the Cosine coefficient). I personally like to
use an "inclusion index" that was developed in library science
but that I didn't saw applied anywhere else (but I didn't look very
hard).
<li>You can use one of the many hierarchical clustering algorithms or use
something like a K-means or J-means clustering method.
<li>You may also apply various feature selection methods.
</ul>I wonder whether someone have written a paper (or a book) on how
those various ways of performing cluster on textual data differ, how each
way may tackle different realities. I have seen some articles comparing a
few of those features for the analysis of textual data, and I am also
familiar with books devoted to clustering of numerical data and
discussing all those aspects in this context, but what I am looking for
is a more comprehensive discussion of all those aspects of clustering
when applied to the analysis of textual data. Any suggested
reading?<br><br>
Normand Peladeau<br>
Provalis Research<br>
<a href="http://www.simstat.com/" eudora="autourl">www.simstat.com<br><br>
</a></body>
</html>