Corpora: approximations (bounds) for edit distance

Robert Luk (COMP staff) csrluk at comp.polyu.edu.hk
Tue Dec 4 03:20:55 UTC 2001


Hi CR,

> I'd like to give one more clarification:
>
> The preprocessing time is not important for me. Preprocess time can be very
> high, but the amount of information I gather from preprocess should be
> limited (space). In other words, I want to preprocess the data and keep some
> limited information (such as frequencies), and use this information (with
> less space) for edit distance approximations (bounds).

> In the example I gave before (keeping the frequencies), instead of keeping
> the whole strings, we just keep the frequencies and compute a lower bound to
> the edit distance using the frequency information. For example, in a DNA
> sequence, there is only four letters and the frequency vector will be only 4
> dimensions. Instead of the whole strings (which could be large), we just
> keep 4 integers. What I'm looking for is to do more preprocessing and find a
> better bound.

My guess is that it might be possible precompile the (var or fixed) context windows
if you are using
the same string for approx. matching. The single-char case is a special
case with a window of 1 character. This sounds like an
interesting research problem if there are no body who worked on it. Is
there any interest to collaborate? Are you working for some company or
are you a faculty/research student of some University?

Best Regards,

Robert Luk
PS: Viterbi is 1 DP algo. There are other DP algos for more general cases.

> >and sum them up as some kind of bounds. However, this requires O(|x| |y|)
> >cost in general.
> >Another less expensive one is to use a fixed size window (say within 4
> >characters)
> >to look for these edit distance bounds and the time complexity is O(|x|),
> >which is
> >the same if you compute the character occurrences. You can also increase
> >the
> >window size to get better and better estimates of the bounds.
>
>
> _________________________________________________________________
> Get your FREE download of MSN Explorer at http://explorer.msn.com/intl.asp
>
>



More information about the Corpora mailing list