<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">

<HTML>

<HEAD>

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">

<META NAME="Generator" CONTENT="MS Exchange Server version 5.5.2653.12">

<TITLE>SV: Corpora: edit distance and spell checking</TITLE>

</HEAD>

<BODY>


<P><FONT SIZE=2>Is there anyone who has tried the perl package string::approx with success when trying to spell check a corpus? Or does anyone have another suggestion? Our aim is to try to generate a lexicon from the corpus but because of the topic, there are lots of frequent spelling mistakes.</FONT></P>


<P><FONT SIZE=2>/Kristina Kjellson</FONT>

<BR><FONT SIZE=2>Language engineer</FONT>

<BR><FONT SIZE=2>Nordisk språkteknologi, Norway</FONT>

</P>

<BR>

<BR>

<BR>

<BR>


<P><FONT SIZE=2>-----Ursprungligt meddelande-----</FONT>

<BR><FONT SIZE=2>Från: Bruce L. Lambert, Ph.D. [<A HREF="mailto:lambertb@uic.edu">mailto:lambertb@uic.edu</A>]</FONT>

<BR><FONT SIZE=2>Skickat: den 30 november 2001 19:43</FONT>

<BR><FONT SIZE=2>Till: CORPORA@HD.UIB.NO</FONT>

<BR><FONT SIZE=2>Ämne: Re: Corpora: approximations (bounds) for edit distance</FONT>

</P>

<BR>


<P><FONT SIZE=2>Maybe I'm missing something, but the upper bound on edit distance between </FONT>

<BR><FONT SIZE=2>two strings is always the length of the longer string, and the lower bound </FONT>

<BR><FONT SIZE=2>is always zero (when the strings are identical).</FONT>

</P>


<P><FONT SIZE=2>-bruce</FONT>

</P>

<BR>


<P><FONT SIZE=2>At 06:43 PM 11/29/01 +0000, Computer Researcher wrote:</FONT>

<BR><FONT SIZE=2>>Hi,</FONT>

<BR><FONT SIZE=2>></FONT>

<BR><FONT SIZE=2>>Does anyone know good approximations (lower and/or upper bounds) to edit </FONT>

<BR><FONT SIZE=2>>distance? (by using some statistical numbers that can be found by </FONT>

<BR><FONT SIZE=2>>preprocessing of the strings)</FONT>

<BR><FONT SIZE=2>></FONT>

<BR><FONT SIZE=2>>In the preprocess time we can transform the strings to a bunch of numbers </FONT>

<BR><FONT SIZE=2>>(e.g., multi-dimensional vectors); and then use these vectors to </FONT>

<BR><FONT SIZE=2>>approximate the edit distance between strings.</FONT>

<BR><FONT SIZE=2>></FONT>

<BR><FONT SIZE=2>>I found a paper by Hadlock, F. (1988), proposing a "lower bound" by using </FONT>

<BR><FONT SIZE=2>>frequencies of the letters in the string. Assuming that the alphabet is </FONT>

<BR><FONT SIZE=2>>same for all strings, all frequency vectors will have same number of </FONT>

<BR><FONT SIZE=2>>dimensions. And he defines a distance metric over these vectors, so that </FONT>

<BR><FONT SIZE=2>>this distance (in the vector space) is a lower-bound to the actual edit </FONT>

<BR><FONT SIZE=2>>distance.</FONT>

<BR><FONT SIZE=2>></FONT>

<BR><FONT SIZE=2>>Do you know any other method that can achieve a similar goal?</FONT>

<BR><FONT SIZE=2>></FONT>

<BR><FONT SIZE=2>>Thanks for your attention,</FONT>

<BR><FONT SIZE=2>></FONT>

<BR><FONT SIZE=2>>CR</FONT>

<BR><FONT SIZE=2>></FONT>

<BR><FONT SIZE=2>>_________________________________________________________________</FONT>

<BR><FONT SIZE=2>>Get your FREE download of MSN Explorer at <A HREF="http://explorer.msn.com/intl.asp" TARGET="_blank">http://explorer.msn.com/intl.asp</A></FONT>

<BR><FONT SIZE=2>></FONT>

</P>


</BODY>

</HTML>