SV: Corpora: edit distance and spell checking

Kristina Kjellson kristina.kjellson at nst.as
Mon Dec 3 10:05:41 UTC 2001


Is there anyone who has tried the perl package string::approx with success
when trying to spell check a corpus? Or does anyone have another suggestion?
Our aim is to try to generate a lexicon from the corpus but because of the
topic, there are lots of frequent spelling mistakes.

/Kristina Kjellson
Language engineer
Nordisk språkteknologi, Norway





-----Ursprungligt meddelande-----
Från: Bruce L. Lambert, Ph.D. [mailto:lambertb at uic.edu]
Skickat: den 30 november 2001 19:43
Till: CORPORA at HD.UIB.NO
Ämne: Re: Corpora: approximations (bounds) for edit distance


Maybe I'm missing something, but the upper bound on edit distance between 
two strings is always the length of the longer string, and the lower bound 
is always zero (when the strings are identical).

-bruce


At 06:43 PM 11/29/01 +0000, Computer Researcher wrote:
>Hi,
>
>Does anyone know good approximations (lower and/or upper bounds) to edit 
>distance? (by using some statistical numbers that can be found by 
>preprocessing of the strings)
>
>In the preprocess time we can transform the strings to a bunch of numbers 
>(e.g., multi-dimensional vectors); and then use these vectors to 
>approximate the edit distance between strings.
>
>I found a paper by Hadlock, F. (1988), proposing a "lower bound" by using 
>frequencies of the letters in the string. Assuming that the alphabet is 
>same for all strings, all frequency vectors will have same number of 
>dimensions. And he defines a distance metric over these vectors, so that 
>this distance (in the vector space) is a lower-bound to the actual edit 
>distance.
>
>Do you know any other method that can achieve a similar goal?
>
>Thanks for your attention,
>
>CR
>
>_________________________________________________________________
>Get your FREE download of MSN Explorer at http://explorer.msn.com/intl.asp
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20011203/257aa666/attachment.htm>


More information about the Corpora mailing list