[Corpora-List] Developing and testing new similarity measures for word clustering

Eric Atwell eric at comp.leeds.ac.uk
Sun Oct 10 21:42:36 UTC 2004


Normand,
You could empirically evaluate the output of a word-clustering program
by comparing results with an established tagset - for example,
word-clusters learnt on an English corpus can be evaluated by seeing
whether words in a cluster share the same PoS-tag in an established
English corpus-based tagset such as tghat used in tagged LOB corpus;
see:

Hughes J and Atwell E. 1994. The automated evaluation of inferred word
classifications, in Cohn A G, (editor), Proceedings of ECAI'94: 11th
European Conference on Artificial Intelligence, pages 535-540, John
Wiley, Chichester.
http://www.comp.leeds.ac.uk/nlp/papers/hughes+atwell94ecai.ps.Z

  - clustering of English word-tpyes into grammatical classes, based on
similarity of contexts in a corpus. Several alternative metrics are
evaluated, by comparing clusters produced with LOB Corpus tagset.

More recently, Leeds PhD student Andy Roberts has used this
word-clustering evaluation technique, comparison with LOB corpus
tagging, to evaluate a different word-clustering approach based on
function-word-collocation profile patterns, see:

Roberts, Andrew. 2002. Automatic Acquisition of Word Classification using
Distributional Analysis of Content Words with Respect to Function Words.
Unpublished Research Report, School of Computing, University of Leeds
http://www.comp.leeds.ac.uk/andyr/research/abstracts/roberts01autoacquire.html


Eric Atwell, School of Computing, Leeds University


On Fri, 8 Oct 2004, Normand Peladeau wrote:

> I have been reviewing some of the similarity measures used to perform word
> clustering (Jaccard, Dice, Simple Matching, correlation, etc.) and I came to
> the conclusion that many of those measures had some metric problems that
> probably make them non optimal for word clustering.
>
> I am working now on some modified versions of those indices and I need some
> ways to benchmark those new similarity measures.  I would like to have a
> series of benchmarks for several kinds of application (dimension reduction,
> automatic identification of themes, automatic taxonomy development, etc.).
>
> I would like suggestions for ways to benchmark those new measures and compare
> their performance with the more traditional ones.  Any idea, reference, data
> set would be welcome.
>
> I am also looking for existing articles where those measures have been
> compared (either empirically or theoretically)
>
>
> Thanks,
>
> Normand Peladeau
> Provalis Research
>
>
>
>

--
Eric Atwell, Senior Lecturer, Computer Vision and Language research group,
School of Computing, University of Leeds, LEEDS LS2 9JT, England
TEL: +44-113-2335430  FAX: +44-113-2335468  http://www.comp.leeds.ac.uk/eric



More information about the Corpora mailing list