Corpora: approximations (bounds) for edit distance

Mon Dec 3 04:13:35 UTC 2001

Two comments:
1. An excellent reference from the bioinformatics people is:

Algorithms on Strings, Trees, and Sequences: Computer Science and
Computational Biology by Dan Gusfield

2. Derrick Higgins and I have used a bit-map approach to speed up
morpheme comparison: in practice, you use a 32, 48, or 64 bit field, and
set the first bit to 1 iff there is an a in the word, the second bit to
1 if there is a b in the word, etc. (Size of alphabet depends on your
language.) You then just perform a bit-wise comparison of two words (two
computer words, not linguistic words!). Bit-wise comparison is extremely
fast of course, and you can determine that there is insufficient overlap
between two morphemes very quickly in the vast majority of cases. Then,
if you do find a pair of morphemes whose letters overlap enough to merit
checking them in the usual linear-programming fashion, you can do so --
but the vast majority of the hopeless pairs have been eliminated in a
much more rapid manner.

John A. Goldsmith
Department of Linguistics, The University of Chicago
ja-goldsmith at uchicago.edu
http://humanities.uchicago.edu/faculty/goldsmith

>My other guess is that people working on DNA sequence matching might
have
>worked on this problem. That's all I could contribute  - sorry.

>Best,

>Robert Luk