[Corpora-List] Text clustering evaluation metrics

Fri Sep 17 16:17:38 UTC 2010

Sergio,

There are quite a few of them out there, especially if your data are
labeled.  In the presence of labels it is possible to use an external
or extrinsic clustering method that compares the partition produced by
your algorithm to the partition specified by the "true" labels on the
data.

I'll give a list of bibtex references at the end of the e-mail, but
briefly, a few examples are V-Measure (Rosenberg and Hirschberg,
2007), Variation of Information (Meila, 2007), Q_0 and Q_2 (Dom,
2001), F-measure (Steinbach, et al., 2000), Average Entropy (Liu, et
al., 2003), and the Adjusted Rand Index (Hubert and Arabie, 1985).
There might be better citations for F-measure and Average Entropy, but
the methods are described in these papers.

There's also a paper that talks about some of these, and others that I
don't have much experience with (such as BCubed), and gives a good
overview in which to compare them.  That is a paper called "A
comparison of Extrinsic Clustering Evaluation", by Amigo, Gonzalo,
Artiles, and Verdejo, written in 2009.

--dan

Bibtex for mos of the references given:
@inproceedings{rh-2007,
	author	=	{Andrew Rosenberg and Julia Hirschberg},
	title	=	{V-Measure: A conditional entropy-based external cluster
evaluation measure},
	booktitle	=	{Joint Conference on Empirical Methods in Natural
Language Processing
				and Computational Language Learning},
	year	=	2007,
	month	=	Jun,
	address	=	{Prague},
	annote	=	{Introduces the V-Measure external clustering metric.  This
metric is the harmonic
			mean of two measures, which the authors call Homogeneity and
Completeness.  Homogeneity
			has to do with the "purity" of the individual clusters and is
highest when all clusters
			consist of only documents from a single natural class.
Completeness is a measure of
			how well points from the same natural class were grouped together,
and is highest when
			every member of every natural class is clustered together with
every other member of
			its own natural class.  The authors also present a discussion of
various types of metrics
			and their metrics.  A rubric of desireable properties for a
clustering metric is cited
			and several metrics are evaluated against it.  The only metrics
that consistently
			satisfy the required properties are the V-measure, Variation of
Information, and
			a metric called variably $Q_0$ and $Q_2$. Datasets used: TDT-4
(Strassel and Glenn, 2003)}
}

@article{m-2007,
	author	=	{Marina Meil\u{a}},
	title	=	{Comparing Clusterings---an information based distance},
	journal	=	{Journal of Multivariate Analysis},
	volume	=	98,
	number	=	5,
	pages	=	{873--895},
	year	=	2007,
	annote	=	{A journal version of a tech report published in 2002).
Introduces the Variation of Information
                                 which the author proves is a true
metric (satisfies the triangle inequality, etc.) for comparing
                                 partionings of a dataset. },
	publisher	=	{Academic Press, Inc.},
	address	=	{Orlando, FL, USA}
}

@techreport{ dom-2001,
	author = {Byron Dom},
	title = "An Information-Theoretic External Cluster-Validity Measure",
	number	=	{RJ10219},
	year	=	2001,
	month	=	Oct,
	institution	=	{IBM},
	url = "citeseer.ist.psu.edu/dom01informationtheoretic.html",
	annote	=	{Presents the $Q_0$ and $Q_2$ metrics, the latter being a
normalized version
			of the former.  This metric contains a penalty for complexity.  That is, if
			two clustering with different values of K are otherwise of the same quality,
                        the one with the lower value of K will get a
better score.}
}

@techreport{skk-2000,
  author =	 {Michael Steinbach and George Karypis and Bipin
                  Kumar},
  title =	 {A Comparison of Document Clustering Techniques},
  organization = {Department of Computer Science and Engineering},
institution = {University of Minnesota},
howpublished =
                  {http://www.cs.umn.edu/tech_reports_upload/tr2000/00-034.pdf},
year = 2000,
month = May,
annote	=	{Compares KMeans and a variation, called bisecting KMeans to
agglomerative clustering.  Separates
			clustering metrics into two categories: Internal and External,
which are computed without
			and with gold-standard classifications, respectively.  External
metrics used include the
			following: entropy, F-measure.  Internal metric used: overall
similarity. Datasets:
			TREC-5(Foreign Brodcast Service and LA Times), TREC-6, TREC-7, Reuters-21578.
			Metrics: F-measure (mostly)},
}

@article{ha-1985,
	author	=	{Lawrence Hubert and Phipps Arabie},
	title	=	{Comparing Partitions},
	journal	=	{Journal of Classification},
	year	=	1985,
	volume	=	2,
	number	=	1,
	pages	=	{193--218},
	month	=	Dec
}

@inproceedings{llcm-2003,
	author	=	{Tao Liu and Shengping Liu and Zheng Chen and Wei-Ying Ma},
	title	=	{An Evaluation on Feature Selection for Text Clustering},
	booktitle	=	{Proceedings of the Twentieth International Conference on
Machine Learning (ICML 2003)},
	address	=	{Washington D.C.},
	year	=	2003,
	month	=	Aug,
	annote	=	{Gives an overview of several methods for unsupervised
feature selection (i.e. for document clustering).
			Previously existing methods discussed include Document Frequency,
Term Strength,
			Entropy-based Information Gain, and $\Chi^2$.  Two new methods
``Term Contribution" and
			``Iterative Feature Selection" are introduced.  The methods are
compared on various datasets.
			Their Term Contribution metric performed very well (winnin in most
of the graphs shown) and
			is less computationally intense than other unsupervised
``single-shot" algorithms.

			Their iterative feature selection techniques start by clustering
the data.  Then, a supervised
			feature selection algorithm is used over the data (they used
Information Gain and $\Chi^2$).
			The data is then clustered again, then the supervised feature
selection algorithm is run again,
			and the process repeats iteratively until the desired number of
features have been removed.
			They actually observed significant performance improvements in
their metrics using this method.
			That is, better clusters were produced with the feature selection
than without it.
			Datasets used: Reuters-21578, 20 Newsgroups, a web directory
dataset collected from the
			Open Directory Project.
			Metrics used: Entropy (weighted sum of the entropy for all
clusters), and Precision (associate a
			class label with each cluster, then calculate precision as normal,
take a weighted average over
			all clusters).}

--dan

On Fri, Sep 17, 2010 at 8:00 AM, Eric Ringger <ringger at cs.byu.edu> wrote:
> Good morning,
>
> I encourage you to respond to this message.
>
> --Eric
>
> -----Original Message-----
> From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of
> Sergio Castro
> Sent: Friday, September 17, 2010 7:10 AM
> To: CORPORA at UIB.NO
> Subject: [Corpora-List] Text clustering evaluation metrics
>
> Hi all,
>
> I'm starting my Ph.D. thesis and I need some information,
> what is the standard evaluation metrics for text clustering?
>
> Thank you for your help.
>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora