Corpora: approximations (bounds) for edit distance

Robert Luk (COMP staff) csrluk at comp.polyu.edu.hk
Sat Dec 1 10:48:49 UTC 2001


Hi Bruce,

> I agree. I was describing the simplest possible edit distance algorithm,
> with mismatch costs equal to 1 and correct substitutions equal to 0. Edit
> distance gets interesting and powerful when one uses a matrix of
> substitution costs, one cost for each possible
> substitution/insertion/deletion in an alphabet, and when the costs reflect
> some inverse function of similarity or confusability "in the real world".
>
> The real problem with edit distance is that is slow. What approaches people
> are using to speed up edit distance calculations? One heuristic approach
> that has been tried is to first use a fast approach like ngram similarity,
> then only compute edit distance on pairs whose ngram similarities exceed
> some threshold. Are there other methods?

There were some research in fast Viterbi, A* algorithms, Dijsktra in the speech
processing community (around early 1990s in IEEE ICASSP), which they
are relevant. Some kind of prunning would help if string lengths are
very long. However, my knowledge is limited to here. My experience with approximate
matching words is not that slow. If you want to do approximate match of a set
of strings with 1 string, obviously a different algo should be used.

Best,

Robert Luk



More information about the Corpora mailing list