<html>

<body>

At 2004-05-11 03:24, you wrote:<br>

<blockquote type=cite class=cite cite>Dear all,<br><br>

Does anyone know of a tool (or algorithm), preferably available

freely<br>

for research purposes, that takes as its input a corpus only and<br>

produces as its output clusters of tokens that occur close to each

other<br>

relatively often?</blockquote><br>

I created such a software but it is a commercial product. You already

obtained suggestions for links to free clustering routines.  If you

can't find something that suits you in public domain software, take a

look at our software

(<a href="http://www.simstat.com/wordstat.htm" eudora="autourl">www.simstat.com/wordstat.htm</a>)<br><br>

However I hope you won't mind me asking a few questions related to the

usefulness of clustering corpus.  There are hundreds of ways of

clustering the same text corpus. To give you just a few examples of

variations:

<ul>

<li>You can define proximity as two words occurring in a sentence, in a

paragraph, an entire text, or a small window of words.  Each of them

will give you different pictures, different information.

<li>You can use many kinds of similarity indices (some based on mere

occurrence like the Jaccard, often used in the clustering of text, or

based on frequencies like the Cosine coefficient). I personally like to

use an "inclusion index" that was developed in library science

but that I didn't saw applied anywhere else (but I didn't look very

hard).

<li>You can use one of the many hierarchical clustering algorithms or use

something like a K-means or J-means clustering method.

<li>You may also apply various feature selection methods.

</ul>I wonder whether someone have written a paper (or a book) on how

those various ways of performing cluster on textual data differ, how each

way may tackle different realities. I have seen some articles comparing a

few of those features for the analysis of textual data, and I am also

familiar with books devoted to clustering of numerical data and

discussing all those aspects in this context, but what I am looking for

is a more comprehensive discussion of all those aspects of clustering

when applied to the analysis of textual data.  Any suggested

reading?<br><br>

Normand Peladeau<br>

Provalis Research<br>

<a href="http://www.simstat.com/" eudora="autourl">www.simstat.com<br><br>

</a></body>

</html>