Corpora: edit distance and spell checking

Mon Dec 3 10:57:21 UTC 2001

SV: Corpora: edit distance and spell checkingHi,

We are studying the use of some improved batch spell checker (using the linguistic context), some results have been published. Considering the document we are working on, a preliminary named entity recognizer was necessary, but we did not conclude yet.

-Patrick
  ----- Original Message ----- 
  From: Kristina Kjellson 
  To: CORPORA at HD.UIB.NO 
  Sent: Monday, December 03, 2001 11:05 AM
  Subject: SV: Corpora: edit distance and spell checking

  Is there anyone who has tried the perl package string::approx with success when trying to spell check a corpus? Or does anyone have another suggestion? Our aim is to try to generate a lexicon from the corpus but because of the topic, there are lots of frequent spelling mistakes.

  /Kristina Kjellson 
  Language engineer 
  Nordisk språkteknologi, Norway 

  -----Ursprungligt meddelande----- 
  Från: Bruce L. Lambert, Ph.D. [mailto:lambertb at uic.edu] 
  Skickat: den 30 november 2001 19:43 
  Till: CORPORA at HD.UIB.NO 
  Ämne: Re: Corpora: approximations (bounds) for edit distance 

  Maybe I'm missing something, but the upper bound on edit distance between 
  two strings is always the length of the longer string, and the lower bound 
  is always zero (when the strings are identical). 

  -bruce 

  At 06:43 PM 11/29/01 +0000, Computer Researcher wrote: 
  >Hi, 
  > 
  >Does anyone know good approximations (lower and/or upper bounds) to edit 
  >distance? (by using some statistical numbers that can be found by 
  >preprocessing of the strings) 
  > 
  >In the preprocess time we can transform the strings to a bunch of numbers 
  >(e.g., multi-dimensional vectors); and then use these vectors to 
  >approximate the edit distance between strings. 
  > 
  >I found a paper by Hadlock, F. (1988), proposing a "lower bound" by using 
  >frequencies of the letters in the string. Assuming that the alphabet is 
  >same for all strings, all frequency vectors will have same number of 
  >dimensions. And he defines a distance metric over these vectors, so that 
  >this distance (in the vector space) is a lower-bound to the actual edit 
  >distance. 
  > 
  >Do you know any other method that can achieve a similar goal? 
  > 
  >Thanks for your attention, 
  > 
  >CR 
  > 
  >_________________________________________________________________ 
  >Get your FREE download of MSN Explorer at http://explorer.msn.com/intl.asp 
  > 

------=_NextPart_000_0011_01C17BF1.AA6C3EC0
Content-Type: text/html;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD><TITLE>SV: Corpora: edit distance and spell checking</TITLE>
<META http-equiv=Content-Type content="text/html; charset=iso-8859-1">
<META content="MSHTML 6.00.2600.0" name=GENERATOR>
<STYLE></STYLE>
</HEAD>
<BODY bgColor=#ffffff>
<DIV>Hi,</DIV>
<DIV> </DIV>
<DIV>We are studying the use of some improved batch 
spell checker (using the linguistic context), some results have been published. 
Considering the document we are working on, a preliminary named entity 
recognizer was necessary, but we did not conclude yet.</DIV>
<DIV> </DIV>
<DIV>-Patrick</DIV>
<BLOCKQUOTE dir=ltr 
style="PADDING-RIGHT: 0px; PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: #000000 2px solid; MARGIN-RIGHT: 0px">
 <DIV style="FONT: 10pt arial">----- Original Message ----- </DIV>
 <DIV 
 style="BACKGROUND: #e4e4e4; FONT: 10pt arial; font-color: black">From: 
 <A title=kristina.kjellson at nst.as 
 href="mailto:kristina.kjellson at nst.as">Kristina Kjellson</A> </DIV>
 <DIV style="FONT: 10pt arial">To: <A title=CORPORA at HD.UIB.NO 
 href="mailto:CORPORA at HD.UIB.NO">CORPORA at HD.UIB.NO</A> </DIV>
 <DIV style="FONT: 10pt arial">Sent: Monday, December 03, 2001 11:05 
 AM</DIV>
 <DIV style="FONT: 10pt arial">Subject: SV: Corpora: edit distance and 
 spell checking</DIV>
 <DIV> </DIV>
 Is there anyone who has tried the perl package string::approx 
 with success when trying to spell check a corpus? Or does anyone have another 
 suggestion? Our aim is to try to generate a lexicon from the corpus but 
 because of the topic, there are lots of frequent spelling mistakes.
 /Kristina Kjellson Language 
 engineer Nordisk språkteknologi, Norway 
 
 -----Ursprungligt meddelande----- Från: Bruce L. Lambert, Ph.D. [<A 
 href="mailto:lambertb at uic.edu">mailto:lambertb at uic.edu</A>] Skickat: den 30 november 2001 19:43 Till: 
 CORPORA at HD.UIB.NO Ämne: Re: Corpora: approximations 
 (bounds) for edit distance 
 Maybe I'm missing something, but the upper bound on edit 
 distance between two strings is always the length of 
 the longer string, and the lower bound is always zero 
 (when the strings are identical). 
 -bruce 
 At 06:43 PM 11/29/01 +0000, Computer Researcher wrote: 
 >Hi, > >Does anyone know good approximations (lower and/or upper bounds) to 
 edit >distance? (by using some statistical numbers 
 that can be found by >preprocessing of the 
 strings) > >In the 
 preprocess time we can transform the strings to a bunch of numbers 
 >(e.g., multi-dimensional vectors); and then use 
 these vectors to >approximate the edit distance 
 between strings. > >I 
 found a paper by Hadlock, F. (1988), proposing a "lower bound" by using 
 >frequencies of the letters in the string. Assuming 
 that the alphabet is >same for all strings, all 
 frequency vectors will have same number of >dimensions. And he defines a distance metric over these vectors, so 
 that >this distance (in the vector space) is a 
 lower-bound to the actual edit >distance. 
 > >Do you know any other method 
 that can achieve a similar goal? > >Thanks for your attention, > 
 >CR > >_________________________________________________________________ 
 >Get your FREE download of MSN Explorer at <A 
 href="http://explorer.msn.com/intl.asp" 
 target=_blank>http://explorer.msn.com/intl.asp</A> > </BLOCKQUOTE></BODY></HTML>

------=_NextPart_000_0011_01C17BF1.AA6C3EC0--