[Corpora-List] How to word presentation for word clustering?

chen wenliang chenwl at mail.neu.edu.cn
Thu Jul 8 01:45:40 UTC 2004


Thanks your reply!

Did you try the method "TF*IDF values as word representation" for word clustering?

I define when two words should be in the same cluster:

For example: football and basketball should be the same cluster because they always appear in "sports" category.

So I prefer two words in the same cluster when they always appear in the same categories.

But I havnt a large labeled documents corpus(label categories) to use class distribution of words for clustering(as Baker 98 says).

I want to cluster words on condition that a large unlabeled documents corpus.
  
Regards,

Chen Wenliang  chenwl at mail.neu.edu.cn 2004-07-08
======= 2004-07-07 Original Message=======

>Dear Chen Wenliang,
>
>I am using TF*IDF values as my representation for words.
>vector w = { tf(1)*IDF(1), tf(2)*IDF(2)....,tf(n)*IDF(n))} where the IDF is
>computed from a large corpus. This seems to give better results than just
>the raw frequency counts.
>The representations I investigated were: TF, TF*IDF and simple binary(1
>represents the word existing in the vector and 0 if it isn't) counts.
>
>Regards,
>
>Clive De Silva
>University of Cambridge
>----- Original Message -----
>From: "chen wenliang" <chenwl at mail.neu.edu.cn>
>To: <corpora at hd.uib.no>
>Sent: Wednesday, July 07, 2004 10:17 AM
>Subject: [Corpora-List] How to word presentation for word clustering?
>
>
>Dear all,
>
>I am looking for a word presentation for word clustering.
>
>I am doing a project that is about word clustering. Now I use a presentation
>that word is presented as
>
>a vector w = {tf(1),tf(2),...,tf(n)}, tf(i) is the frequency of the word in
>document i. Then I use k-means
>
>as the clustering algorithm.
>
>Thanks all.
>  
>
>regards,
>
>Chen Wenliang chenwl at mail.neu.edu.cn
>
>Nlplab, Northeastern University, China.
>
>2004-07-07

= = = = = = = = = = = = = = = = = = = =
			



More information about the Corpora mailing list