[Corpora-List] token clustering tool

Tue May 11 23:07:14 UTC 2004

At 2004-05-11 03:24, you wrote:
>Dear all,
>
>Does anyone know of a tool (or algorithm), preferably available freely
>for research purposes, that takes as its input a corpus only and
>produces as its output clusters of tokens that occur close to each other
>relatively often?

I created such a software but it is a commercial product. You already
obtained suggestions for links to free clustering routines.  If you can't
find something that suits you in public domain software, take a look at our
software (www.simstat.com/wordstat.htm)

However I hope you won't mind me asking a few questions related to the
usefulness of clustering corpus.  There are hundreds of ways of clustering
the same text corpus. To give you just a few examples of variations:
    * You can define proximity as two words occurring in a sentence, in a
paragraph, an entire text, or a small window of words.  Each of them will
give you different pictures, different information.
    * You can use many kinds of similarity indices (some based on mere
occurrence like the Jaccard, often used in the clustering of text, or based
on frequencies like the Cosine coefficient). I personally like to use an
"inclusion index" that was developed in library science but that I didn't
saw applied anywhere else (but I didn't look very hard).
    * You can use one of the many hierarchical clustering algorithms or use
something like a K-means or J-means clustering method.
    * You may also apply various feature selection methods.
I wonder whether someone have written a paper (or a book) on how those
various ways of performing cluster on textual data differ, how each way may
tackle different realities. I have seen some articles comparing a few of
those features for the analysis of textual data, and I am also familiar
with books devoted to clustering of numerical data and discussing all those
aspects in this context, but what I am looking for is a more comprehensive
discussion of all those aspects of clustering when applied to the analysis
of textual data.  Any suggested reading?

Normand Peladeau
Provalis Research
www.simstat.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20040511/92245c68/attachment.htm>