Corpora: approximations (bounds) for edit distance

Computer Researcher compresearch at hotmail.com
Sat Dec 1 18:33:25 UTC 2001


Hi, thanks for the responses.

I'm looking for tighter bounds than 0 and the length of the string, by using
some preprocessing on the strings. So we compute some info in the initial
time, and we use this later.

Assuming that cost of insert, delete, and update are all 1; we can find a
lower-bound as follows:

-----
Let X=(x_1, x_2, x_3, .., x_n) and Y=(y_1, y_2, y_3, .., y_n) be the number
of occurrences of the letters in string A, and B, respectively. We define a
distance between these two vectors X,Y as follows:

if (x_i > y_i) P_i = x_i - y_i
else           N_i = y_i - x_i

and distance(X,Y) = Max(summation of all P_is, summation of all N_is) is the
lower bound for the edit distance(A,B).

This can be proved. Actually this distance is also a metric, and it is a
better bound than 0.
-----

I know a paper (published in 1988) that shows this bound. I had given the
reference in my previous message.

The question is can we come up with a better lower bound? And can we come up
an upper bound (better than the length of the distance)? Note that, the
reason we can find such bounds is that we use some preprocessing on the
strings for later usage.
OR can we come up with an approximation like this (not necessarily a bound
but some distance that is a good approximation to edit distance).

Thanks again for your attention..

CR



>From: Robert Luk (COMP staff) <csrluk at comp.polyu.edu.hk>
>To: CORPORA at HD.UIB.NO, lambertb at uic.edu
>Subject: Re: Corpora: approximations (bounds) for edit distance
>Date: Sat, 1 Dec 2001 07:22:23 +0800 (HKT)
>
>Hi Bruce,
>
> > Maybe I'm missing something, but the upper bound on edit distance
>between
> > two strings is always the length of the longer string, and the lower
>bound
> > is always zero (when the strings are identical).
>
>I guess it depends on the edit cost of the operations.
>For most applications, we do expect all edit costs to
>be larger than or equals to zero and the lb edit
>distance between any 2 strings is 0. However, the ub
>of any 2 strings is the length of the longer string
>is true (I guess) if insert, delete and substitution
>edit costs are 1 and 0 for correct substitution.
>If I remember correctly, that's the Levenstein
>metric but one needs to check the details for
>correctness. In the applications that I have worked,
>it is not uncommon to use edit costs are than
>the Levenstein.
>
>Cheers,
>
>Robert Luk
>


_________________________________________________________________
Get your FREE download of MSN Explorer at http://explorer.msn.com/intl.asp



More information about the Corpora mailing list