[Corpora-List] Developing and testing new similarity measures for word clustering

Normand Peladeau peladeau at simstat.com
Tue Oct 12 16:00:34 UTC 2004


Many thanks to all those who answered my question about methods for
comparing similarity measures.  I am overwhelmed with new articles, new
perspectives and will need several weeks (or months) in my busy schedule to
assimilate all this information.

Many of the papers suggested to me were quite relevant, but I was
especially impressed by the Julie Wards thesis on similarity measures.  I
share with her the view that some similarity measures may be better for
some applications, while others may be appropriate for other types of
applications.

One type of application that I didn't saw mentioned was knowledge discovery
and I believe that it may require very different similarity measures than
those used for automatic thesaurus construction, text retrieval, etc..

In a project I am working on right now, we try to identify abnormally high
relationship of unrelated words (it is a project related to ergonomic
problems and human errors in airplane flights).  We found the following
measure to be very sensitive to the discovery of unexpected relationships:

	Inclusion index =   a  /  min (a+b, a+c)

This index of inclusion which varies between 0 and 1 has been used in
library sciences to identify hierarchical relationship between words.  One
interesting property is that it will reach a maximum value of 1 if word #1
is always associated with word #2, despite the fact that word #2 may not
always be associated with word #1.  For example, if "baseball" appear 10
times and may be always associated with "sport" but, "sport" may appear 100
times, only 1/10 of those times in the presence of "baseball".   The
inclusion index take a value of 1 because one word is considered to be
included in the other one (it seems to measure a kind of hyponymy relationship.

We were able to identify with this specific index ergonomic problems that
were real, but that would have remain undetected if we had used other
similarity measures.  I wonder whether such a measure has been used for
other types of application.

There seems to be a lot of empirical studies on those indices, but I have
not seen a lot of theoretical evaluation (but I am not an expert in this
area). I am under the impression that many basic theoretical questions
remain unanswered when we choose a similarity measure.  Here are a few of
those questions:

	1) Should we consider a join absence (both words are absent from a
context) as an indication of their similarity?
	2) Should we consider negative correlation (one word occur but not the
other) as an indication of their dissimilarity or lack of similarity? But
what about synonyms?
	3) Should we consider the probabilistic nature of co-occurrences?
	etc.

Many of the measures we use make different assumptions about those questions.

For example, from what I know, it seems that those index make the following
assumptions on those 3 questions:

	Simple matching	1) Yes	2) Partially   3) No
	Jaccard & Dice		1) No	2) Partially   3) No
	Correlation		1) Yes	2) Yes	       3) Yes

I have seen such kind of discussion in biology and ecology but I don't
remember seeing a paper discussing those basic questions for the analysis
of textual data.  Does someone knows a good discussion on those theoretical
issues?

Best regards,

Normand Peladeau
Provalis Research
www.simstat.com



More information about the Corpora mailing list