Corpora: Summary: Measures for similarity between two sentences

Tom Vanallemeersch Tom.Vanallemeersch at lant.be
Mon Nov 20 17:24:12 UTC 2000


Sorry for this late reply (I just got back from vacation). I developed
functionality in Emacs for comparing two sentences, more specifically
translations. It detects common strings in two sentences and small
variants (spelling variants etc.). The common parts are visualized using
colors, and two scores are computed, a similarity score and a score
expressing the effort needed to modify the first sentence into the
second one.

In the picture below are a few sentences from the CRATER corpus, i.e. 2
raw sentences and the corresponding tagged sentences. Each example is
delimited by a dashed line. Common strings are highlighted in grey, small
differences underlined, and common strings with a different order in both
sentences start with a green block. Below the sentences compared is a line
with the similarity score and the effort score. The effort score is
calculated on the basis of the number of deletions, insertions, and common
parts with different order in both sentences. The higher the effort score,
the more effort is needed. The similarity score depends on the effort
score and the length of the sentences. A similarity score of 1 indicates
equalness. In case of the tagged sentences, the tags are considered part
of the text (i.e. not recognized as such). A higher correspondence between
tags will produce a higher similarity score.

Hope this helps,

Tom.

--
LANT nv/sa, Research Park Haasrode, Interleuvenlaan 21, B-3001 Leuven
mailto:Tom.Vanallemeersch at lant.be               Phone: ++32 16 405140
http://www.lant.be/                             Fax: ++32 16 404961



[From Admin Corpora list:

Picture at:  http://www.hit.uib.no/corpora/compsent.gif      ]



More information about the Corpora mailing list