Corpora: Summary: Fast edit distance algorithms
Mark Lewellen
mlewellen at apptek.com
Thu Jan 13 22:09:04 UTC 2000
Hello:
Following is a summary of responses to my query about edit
distance algorithms. I sincerely apologize for the delayed
response. The original query was:
> Can anyone point me to relatively recent work (roughly
> the last five years) on fast methods of computing edit distance?
> I'm interested in algorithms that compare pairs of individual words
> (not algorithms that search long text strings for approximate matches
> of words). Opinions on fastest methods welcome also. I'll post a
> summary of responses to the list.
Many thanks to those who responded:
Kemal Oflazer
Bruce Lambert
Martin Kay
Gerardo Sierra
Moshe Lewenstein
Paul Bowden
Gertjan van Noord
George Demetriou
Overall summary: A variety of very useful advice on string
matching was sent. Interestingly, though, none of the responses
pointed to recent research that focused on speeding up edit-
distance algorithms. Some related research was by K. Oflazar on
finite-state methods and Amir et. al. on swaps. Two books in
the area were written in 1994 by Stephen and Aoe. Some
people wrote that edit distance algorithms are simply too
slow, and suggested alternative algorithms. It makes one
wonder if there is any on-going research out there on speeding
up edit-distance algorithms! Since edit-distance
algorithms use dynamic programming, one suggestion involves
a fast dynamic-programming method. Finally, a code snippet
was generously provided.
Mark
Details:
----------------------------------------------------
References that were suggested are:
Articles
1) Oflazer, Kemal, 1996. "Error-tolerant Finite-state Recognition with
Applications to Morphological Analysis and Spelling Correction".
Computational Linguistics 22(1):73-89.
Available at http://www.cs.bilkent.edu.tr/~ko/ko.html.
This paper describes a method for a very fast way of finding strings
that are very close to some string in a regular language described by
a finite state machine.
2) Amir, Amihood, Yonatan Aumann, Gad M. Landau, Moshe Lewenstein
and Noa Lewenstein, 1997. "Pattern Matching with Swaps". Proceedings,
38th Annual IEEE Conference on Foundations of Computer Science (FOCS),
Miami Beach, Florida, pp. 148-153.
The algorithms I have worked on considered swaps as an isolated
error. In other words I considered the problem of string matching
with swaps, i.e. given a pattern P finds all locations where P
appears in a text T even if swaps have occured in T.
In theoretical measures the best known algorithms (that I know of)
for edit distance with swaps still runs in O(nm) time where n is the
text size and m the pattern size. There is an algorithm of Landau
and Vishkin that runs in O(nk) time, where k bounds the number of
errors.
Our original algorithm for string matching with swaps (not edit
distance) runs in O(nm^{1/3} \log m) and you are welcome to take
it off my homepage at http://www.cs.biu.ac.il/~moshe
I assume that this is not the problem you are really interested in.
However, if you are, we will soon have a new paper improving on
the previous results.
[The website contains other interesting articles about pattern-matching
in text, as well.]
Books
1) Stephen, Graham A., 1994. String Searching Algorithms. World Scientific
Publishing, Singapore.
See http://www1.fatbrain.com/asp/bookinfo/bookinfo.asp?theisbn=9810237030
2) Aoe, Jun-Ichi, 1994. Computer Algorithms: String Pattern-Matching
Strategies. IEEE Computer Society.
See
http://www.amazon.com/exec/obidos/ASIN/0818654627/o/qid=947796598/sr=8-3/102
-8070569-5159269
Projects
UMIST has just developed an algorithm to identify semantic clusters
through the alignment of definitions using edit distance. It compares
two definitions and matches the pair of words that can be substituted
in a particular context without changing the meaning. The clusters are
used to expand a query in a dictionary we also designed for searching a
suitable term whose concept the user knows but which he does not remember
how to designate correctly.
http://www.mcsierra.freeserve.co.uk/penpict.html [may be an old link]
---------------------------------
Other approaches/references:
(1) Edit distance is inherently (algorithmically) complex, so it will never
be
really FAST. I've optimized my edit distance code (in Lisp) as much as I
can, and it's still not fast. The best thing to do if you really need speed
is NOT to use edit distance, but use ngram methods instead. They can be
made really fast with special indexing.
(2) I use something called the Ratcliff-Obershelp algorithm, which returns a
percentage similarity between any two words.
Unfortunately, I can't give you a reference to it, as I learned of it many
years ago in "Computing", a British trade newspaper. But I coded it up in
'C' and it works a treat! (Perhaps someone could give ME a reference to
this algorithm?) [from Paul Bowden, pmr at doc.ntu.ac.uk]
(3) You may want to have a look at agrep's algorithm for approximate
pattern matching. It is a Boyer-Moore type of algorithm with some
changes made to it to speed up search considerably according to its
developers (e-mail sw at cs.arizona.edu or udi at cs.arizona.edu for more
details).
You may also find it useful to look at DP-like algorithms in the
speech recognition area (alignment of phonetic sequences with phonetic
dictionary entries) as well as the bionformatics area (especially if
you need to apply efficient searching for common subsequences).
--------------------------------------------------
Dynamic Programming
Surely this is as near as you can get to a classic case of dynamic
programming and it would therefore be quite surprising of you could do
better than Floyd's algorithm. Depth-first search is pessimal, but caching
makes it near to optimal!
---------------------------------------------------
Code
And finally, some helpful code from Gertjan van Noord:
below is a C function which calculates Levenshtein distance for
two given strings. I don't know whether this is fast enough; it is
about ten times faster than the equivalent Perl version.
#define MAXLINE 100
int lev(char a[],char b[]) {
int arr[MAXLINE][MAXLINE];
int i,j,l,m,n,add;
for (i=0;i<=strlen(a);i++) {
arr[0][i]=i;
}
for (j=0;j<=strlen(b);j++) {
arr[j][0]=j;
}
for (j=1;j<=strlen(b);j++) {
for (i=1;i<=strlen(a);i++) {
if (a[i-1] == b[j-1])
{ add=0; } else { add=1; }
m = 1+arr[j-1][i];
l = 1+arr[j][i-1];
n = add+arr[j-1][i-1];
arr[j][i] = (m < l ? (m < n ? m : n): (l < n ? l : n));
}
}
return arr[strlen(b)][strlen(a)];
}
More information about the Corpora
mailing list