Corpora: Summary: Measures for similarity between two sentences

Constantin Orasan in6093 at wlv.ac.uk
Mon Nov 20 15:43:46 UTC 2000


Last week, I posted a message enquiring about measures of similarity
between two sentences. I would like to thank to:
- Christopher Brewster
- Miles Osborne
- Jennifer Spenader
- Alexander Gelbukh
- Kevin McTait
- Patrick Ruch
- Ken Litkowski
- Barb Ball
- Manuel Montes
- Bill Fisher
- Andreas Faatz
for their answers and suggestions. Given that few people expressed their
interest in this topic, here is a summary:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In the PhD thesis "Collocational similarity : emergent patterns in
lexical environments" by Paul Richard Hays, 1997, Birmingham, KWIC lines
are compared. Maybe it can be addapted for comparing sentences.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

String edit distance can be used as a measure(sentences A and B are
equally
similar to C if A and B can be mapped to C using the same number of
edits), but one could easily imagine another set of editing operations.
The application for which the measure is used influences it very much.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Papers which could help are:
- Shieber, Stuart (1993). The Problem of Logical-Form Equivalence,
Computational Linguistics, Vol 19, No. 1
- Spenader, Jennifer (2000). Defining Propositional Similarity:
Systemizing Identification of Presuppositional Binding. Proceedings of
Götalog 2000, Fourth Workshop on the Semantics and Pragmatics of
Dialogue, Göteborg University 15-17 June 2000.
- Emmanuel Planas, MT Summit VII: 'Formalizing Translation Memory'
- Manuel Montes-y-Gómez, Alexander Gelbukh, Aurelio López-López.
Comparison of Conceptual Graphs. Proc. MICAI-2000, 1st Mexican
International Conference on Artificial Intelligence, Acapulco, Mexico,
April 2000. In: O. Cairo, L.E. Sucar, F.J. Cantu (eds.) MICAI 2000:
Advances in Artificial Intelligence. Lecture Notes in Artificial
Intelligence N 1793, ISSN 0302-9743, ISBN
3-540-67354-7, Springer, pp. 548-556
- Kenneth C. Litkowski, 1999, Towards a Meaning-Full Comparison of
Lexical Resources, Proceeding of the Association for Computational
Linguistics Special Interest Group on the Lexicon, June 21-22, College
Park, MD
- Andreas Faatz, Designing clustering methods for ontology building: The
Mo K workbench

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Distance metric can be useful on different levels and it is likely to be
applied on any material likely to be applied on any material (tokens,
part-of.speech, word-sense). A good introduction, theoretical, practical
and
didactic, can be found at:
http://www-igm.univ-mlv.fr/~lecroq/seqcomp/index.html,

Some (unix-like) c code can be downloaded here:
http://odur.let.rug.nl/~kleiweg/levenshtein/

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

ThemeScape software might be useful. It scans entire documents in search
of similarity.  They're at www.cartia.com.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

You can download from the NIST site
(http://www.nist.gov/speech/tools/index.htm) some software called
"aldistsm-1.2.tar.Z" which computes an alignment (edit) distance between
two sentences, where the basic editing operations are changes in
phonological features, including splits and merges on the word level.

==============================
Constantin Orasan
Computational Lingvistics Group
University of Wolverhampton
http://www.wlv.ac.uk/~in6093



More information about the Corpora mailing list