Corpora: edit distance and spell checking

Patrick Ruch patrick.ruch at dim.hcuge.ch
Mon Dec 3 10:57:21 UTC 2001


SV: Corpora: edit distance and spell checkingHi,

We are studying the use of some improved batch spell checker (using the linguistic context), some results have been published. Considering the document we are working on, a preliminary named entity recognizer was necessary, but we did not conclude yet.

-Patrick
  ----- Original Message ----- 
  From: Kristina Kjellson 
  To: CORPORA at HD.UIB.NO 
  Sent: Monday, December 03, 2001 11:05 AM
  Subject: SV: Corpora: edit distance and spell checking


  Is there anyone who has tried the perl package string::approx with success when trying to spell check a corpus? Or does anyone have another suggestion? Our aim is to try to generate a lexicon from the corpus but because of the topic, there are lots of frequent spelling mistakes.

  /Kristina Kjellson 
  Language engineer 
  Nordisk språkteknologi, Norway 






  -----Ursprungligt meddelande----- 
  Från: Bruce L. Lambert, Ph.D. [mailto:lambertb at uic.edu] 
  Skickat: den 30 november 2001 19:43 
  Till: CORPORA at HD.UIB.NO 
  Ämne: Re: Corpora: approximations (bounds) for edit distance 



  Maybe I'm missing something, but the upper bound on edit distance between 
  two strings is always the length of the longer string, and the lower bound 
  is always zero (when the strings are identical). 

  -bruce 



  At 06:43 PM 11/29/01 +0000, Computer Researcher wrote: 
  >Hi, 
  > 
  >Does anyone know good approximations (lower and/or upper bounds) to edit 
  >distance? (by using some statistical numbers that can be found by 
  >preprocessing of the strings) 
  > 
  >In the preprocess time we can transform the strings to a bunch of numbers 
  >(e.g., multi-dimensional vectors); and then use these vectors to 
  >approximate the edit distance between strings. 
  > 
  >I found a paper by Hadlock, F. (1988), proposing a "lower bound" by using 
  >frequencies of the letters in the string. Assuming that the alphabet is 
  >same for all strings, all frequency vectors will have same number of 
  >dimensions. And he defines a distance metric over these vectors, so that 
  >this distance (in the vector space) is a lower-bound to the actual edit 
  >distance. 
  > 
  >Do you know any other method that can achieve a similar goal? 
  > 
  >Thanks for your attention, 
  > 
  >CR 
  > 
  >_________________________________________________________________ 
  >Get your FREE download of MSN Explorer at http://explorer.msn.com/intl.asp 
  > 


------=_NextPart_000_0011_01C17BF1.AA6C3EC0
Content-Type: text/html;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD><TITLE>SV: Corpora: edit distance and spell checking</TITLE>
<META http-equiv=Content-Type content="text/html; charset=iso-8859-1">
<META content="MSHTML 6.00.2600.0" name=GENERATOR>
<STYLE></STYLE>
</HEAD>
<BODY bgColor=#ffffff>
<DIV><FONT face=Arial size=2>Hi,</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>We are studying the use of some improved batch 
spell checker (using the linguistic context), some results have been published. 
Considering the document we are working on, a preliminary named entity 
recognizer was necessary, but we did not conclude yet.</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>-Patrick</FONT></DIV>
<BLOCKQUOTE dir=ltr 
style="PADDING-RIGHT: 0px; PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: #000000 2px solid; MARGIN-RIGHT: 0px">
  <DIV style="FONT: 10pt arial">----- Original Message ----- </DIV>
  <DIV 
  style="BACKGROUND: #e4e4e4; FONT: 10pt arial; font-color: black"><B>From:</B> 
  <A title=kristina.kjellson at nst.as 
  href="mailto:kristina.kjellson at nst.as">Kristina Kjellson</A> </DIV>
  <DIV style="FONT: 10pt arial"><B>To:</B> <A title=CORPORA at HD.UIB.NO 
  href="mailto:CORPORA at HD.UIB.NO">CORPORA at HD.UIB.NO</A> </DIV>
  <DIV style="FONT: 10pt arial"><B>Sent:</B> Monday, December 03, 2001 11:05 
  AM</DIV>
  <DIV style="FONT: 10pt arial"><B>Subject:</B> SV: Corpora: edit distance and 
  spell checking</DIV>
  <DIV><BR></DIV>
  <P><FONT size=2>Is there anyone who has tried the perl package string::approx 
  with success when trying to spell check a corpus? Or does anyone have another 
  suggestion? Our aim is to try to generate a lexicon from the corpus but 
  because of the topic, there are lots of frequent spelling mistakes.</FONT></P>
  <P><FONT size=2>/Kristina Kjellson</FONT> <BR><FONT size=2>Language 
  engineer</FONT> <BR><FONT size=2>Nordisk språkteknologi, Norway</FONT> 
  </P><BR><BR><BR><BR>
  <P><FONT size=2>-----Ursprungligt meddelande-----</FONT> <BR><FONT 
  size=2>Från: Bruce L. Lambert, Ph.D. [<A 
  href="mailto:lambertb at uic.edu">mailto:lambertb at uic.edu</A>]</FONT> <BR><FONT 
  size=2>Skickat: den 30 november 2001 19:43</FONT> <BR><FONT size=2>Till: 
  CORPORA at HD.UIB.NO</FONT> <BR><FONT size=2>Ämne: Re: Corpora: approximations 
  (bounds) for edit distance</FONT> </P><BR>
  <P><FONT size=2>Maybe I'm missing something, but the upper bound on edit 
  distance between </FONT><BR><FONT size=2>two strings is always the length of 
  the longer string, and the lower bound </FONT><BR><FONT size=2>is always zero 
  (when the strings are identical).</FONT> </P>
  <P><FONT size=2>-bruce</FONT> </P><BR>
  <P><FONT size=2>At 06:43 PM 11/29/01 +0000, Computer Researcher wrote:</FONT> 
  <BR><FONT size=2>>Hi,</FONT> <BR><FONT size=2>></FONT> <BR><FONT 
  size=2>>Does anyone know good approximations (lower and/or upper bounds) to 
  edit </FONT><BR><FONT size=2>>distance? (by using some statistical numbers 
  that can be found by </FONT><BR><FONT size=2>>preprocessing of the 
  strings)</FONT> <BR><FONT size=2>></FONT> <BR><FONT size=2>>In the 
  preprocess time we can transform the strings to a bunch of numbers 
  </FONT><BR><FONT size=2>>(e.g., multi-dimensional vectors); and then use 
  these vectors to </FONT><BR><FONT size=2>>approximate the edit distance 
  between strings.</FONT> <BR><FONT size=2>></FONT> <BR><FONT size=2>>I 
  found a paper by Hadlock, F. (1988), proposing a "lower bound" by using 
  </FONT><BR><FONT size=2>>frequencies of the letters in the string. Assuming 
  that the alphabet is </FONT><BR><FONT size=2>>same for all strings, all 
  frequency vectors will have same number of </FONT><BR><FONT 
  size=2>>dimensions. And he defines a distance metric over these vectors, so 
  that </FONT><BR><FONT size=2>>this distance (in the vector space) is a 
  lower-bound to the actual edit </FONT><BR><FONT size=2>>distance.</FONT> 
  <BR><FONT size=2>></FONT> <BR><FONT size=2>>Do you know any other method 
  that can achieve a similar goal?</FONT> <BR><FONT size=2>></FONT> <BR><FONT 
  size=2>>Thanks for your attention,</FONT> <BR><FONT size=2>></FONT> 
  <BR><FONT size=2>>CR</FONT> <BR><FONT size=2>></FONT> <BR><FONT 
  size=2>>_________________________________________________________________</FONT> 
  <BR><FONT size=2>>Get your FREE download of MSN Explorer at <A 
  href="http://explorer.msn.com/intl.asp" 
  target=_blank>http://explorer.msn.com/intl.asp</A></FONT> <BR><FONT 
  size=2>></FONT> </P></BLOCKQUOTE></BODY></HTML>

------=_NextPart_000_0011_01C17BF1.AA6C3EC0--



More information about the Corpora mailing list