Corpora: approximations (bounds) for edit distance

Bruce L. Lambert lambertb at uic.edu
Sat Dec 1 04:01:08 UTC 2001


> > Maybe I'm missing something, but the upper bound on edit distance between
> > two strings is always the length of the longer string, and the lower bound
> > is always zero (when the strings are identical).
>
>I guess it depends on the edit cost of the operations.
>For most applications, we do expect all edit costs to
>be larger than or equals to zero and the lb edit
>distance between any 2 strings is 0. However, the ub
>of any 2 strings is the length of the longer string
>is true (I guess) if insert, delete and substitution
>edit costs are 1 and 0 for correct substitution.
>If I remember correctly, that's the Levenstein
>metric but one needs to check the details for
>correctness. In the applications that I have worked,
>it is not uncommon to use edit costs are than
>the Levenstein.

Hi Robert,

I agree. I was describing the simplest possible edit distance algorithm,
with mismatch costs equal to 1 and correct substitutions equal to 0. Edit
distance gets interesting and powerful when one uses a matrix of
substitution costs, one cost for each possible
substitution/insertion/deletion in an alphabet, and when the costs reflect
some inverse function of similarity or confusability "in the real world".

The real problem with edit distance is that is slow. What approaches people
are using to speed up edit distance calculations? One heuristic approach
that has been tried is to first use a fast approach like ngram similarity,
then only compute edit distance on pairs whose ngram similarities exceed
some threshold. Are there other methods?

-bruce



More information about the Corpora mailing list