Corpora: edit distance and spell checking
Patrick Ruch
patrick.ruch at dim.hcuge.ch
Mon Dec 3 10:57:21 UTC 2001
SV: Corpora: edit distance and spell checkingHi,
We are studying the use of some improved batch spell checker (using the linguistic context), some results have been published. Considering the document we are working on, a preliminary named entity recognizer was necessary, but we did not conclude yet.
-Patrick
----- Original Message -----
From: Kristina Kjellson
To: CORPORA at HD.UIB.NO
Sent: Monday, December 03, 2001 11:05 AM
Subject: SV: Corpora: edit distance and spell checking
Is there anyone who has tried the perl package string::approx with success when trying to spell check a corpus? Or does anyone have another suggestion? Our aim is to try to generate a lexicon from the corpus but because of the topic, there are lots of frequent spelling mistakes.
/Kristina Kjellson
Language engineer
Nordisk språkteknologi, Norway
-----Ursprungligt meddelande-----
Från: Bruce L. Lambert, Ph.D. [mailto:lambertb at uic.edu]
Skickat: den 30 november 2001 19:43
Till: CORPORA at HD.UIB.NO
Ämne: Re: Corpora: approximations (bounds) for edit distance
Maybe I'm missing something, but the upper bound on edit distance between
two strings is always the length of the longer string, and the lower bound
is always zero (when the strings are identical).
-bruce
At 06:43 PM 11/29/01 +0000, Computer Researcher wrote:
>Hi,
>
>Does anyone know good approximations (lower and/or upper bounds) to edit
>distance? (by using some statistical numbers that can be found by
>preprocessing of the strings)
>
>In the preprocess time we can transform the strings to a bunch of numbers
>(e.g., multi-dimensional vectors); and then use these vectors to
>approximate the edit distance between strings.
>
>I found a paper by Hadlock, F. (1988), proposing a "lower bound" by using
>frequencies of the letters in the string. Assuming that the alphabet is
>same for all strings, all frequency vectors will have same number of
>dimensions. And he defines a distance metric over these vectors, so that
>this distance (in the vector space) is a lower-bound to the actual edit
>distance.
>
>Do you know any other method that can achieve a similar goal?
>
>Thanks for your attention,
>
>CR
>
>_________________________________________________________________
>Get your FREE download of MSN Explorer at http://explorer.msn.com/intl.asp
>
------=_NextPart_000_0011_01C17BF1.AA6C3EC0
Content-Type: text/html;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD><TITLE>SV: Corpora: edit distance and spell checking</TITLE>
<META http-equiv=Content-Type content="text/html; charset=iso-8859-1">
<META content="MSHTML 6.00.2600.0" name=GENERATOR>
<STYLE></STYLE>
</HEAD>
<BODY bgColor=#ffffff>
<DIV><FONT face=Arial size=2>Hi,</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>We are studying the use of some improved batch
spell checker (using the linguistic context), some results have been published.
Considering the document we are working on, a preliminary named entity
recognizer was necessary, but we did not conclude yet.</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>-Patrick</FONT></DIV>
<BLOCKQUOTE dir=ltr
style="PADDING-RIGHT: 0px; PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: #000000 2px solid; MARGIN-RIGHT: 0px">
<DIV style="FONT: 10pt arial">----- Original Message ----- </DIV>
<DIV
style="BACKGROUND: #e4e4e4; FONT: 10pt arial; font-color: black"><B>From:</B>
<A title=kristina.kjellson at nst.as
href="mailto:kristina.kjellson at nst.as">Kristina Kjellson</A> </DIV>
<DIV style="FONT: 10pt arial"><B>To:</B> <A title=CORPORA at HD.UIB.NO
href="mailto:CORPORA at HD.UIB.NO">CORPORA at HD.UIB.NO</A> </DIV>
<DIV style="FONT: 10pt arial"><B>Sent:</B> Monday, December 03, 2001 11:05
AM</DIV>
<DIV style="FONT: 10pt arial"><B>Subject:</B> SV: Corpora: edit distance and
spell checking</DIV>
<DIV><BR></DIV>
<P><FONT size=2>Is there anyone who has tried the perl package string::approx
with success when trying to spell check a corpus? Or does anyone have another
suggestion? Our aim is to try to generate a lexicon from the corpus but
because of the topic, there are lots of frequent spelling mistakes.</FONT></P>
<P><FONT size=2>/Kristina Kjellson</FONT> <BR><FONT size=2>Language
engineer</FONT> <BR><FONT size=2>Nordisk språkteknologi, Norway</FONT>
</P><BR><BR><BR><BR>
<P><FONT size=2>-----Ursprungligt meddelande-----</FONT> <BR><FONT
size=2>Från: Bruce L. Lambert, Ph.D. [<A
href="mailto:lambertb at uic.edu">mailto:lambertb at uic.edu</A>]</FONT> <BR><FONT
size=2>Skickat: den 30 november 2001 19:43</FONT> <BR><FONT size=2>Till:
CORPORA at HD.UIB.NO</FONT> <BR><FONT size=2>Ämne: Re: Corpora: approximations
(bounds) for edit distance</FONT> </P><BR>
<P><FONT size=2>Maybe I'm missing something, but the upper bound on edit
distance between </FONT><BR><FONT size=2>two strings is always the length of
the longer string, and the lower bound </FONT><BR><FONT size=2>is always zero
(when the strings are identical).</FONT> </P>
<P><FONT size=2>-bruce</FONT> </P><BR>
<P><FONT size=2>At 06:43 PM 11/29/01 +0000, Computer Researcher wrote:</FONT>
<BR><FONT size=2>>Hi,</FONT> <BR><FONT size=2>></FONT> <BR><FONT
size=2>>Does anyone know good approximations (lower and/or upper bounds) to
edit </FONT><BR><FONT size=2>>distance? (by using some statistical numbers
that can be found by </FONT><BR><FONT size=2>>preprocessing of the
strings)</FONT> <BR><FONT size=2>></FONT> <BR><FONT size=2>>In the
preprocess time we can transform the strings to a bunch of numbers
</FONT><BR><FONT size=2>>(e.g., multi-dimensional vectors); and then use
these vectors to </FONT><BR><FONT size=2>>approximate the edit distance
between strings.</FONT> <BR><FONT size=2>></FONT> <BR><FONT size=2>>I
found a paper by Hadlock, F. (1988), proposing a "lower bound" by using
</FONT><BR><FONT size=2>>frequencies of the letters in the string. Assuming
that the alphabet is </FONT><BR><FONT size=2>>same for all strings, all
frequency vectors will have same number of </FONT><BR><FONT
size=2>>dimensions. And he defines a distance metric over these vectors, so
that </FONT><BR><FONT size=2>>this distance (in the vector space) is a
lower-bound to the actual edit </FONT><BR><FONT size=2>>distance.</FONT>
<BR><FONT size=2>></FONT> <BR><FONT size=2>>Do you know any other method
that can achieve a similar goal?</FONT> <BR><FONT size=2>></FONT> <BR><FONT
size=2>>Thanks for your attention,</FONT> <BR><FONT size=2>></FONT>
<BR><FONT size=2>>CR</FONT> <BR><FONT size=2>></FONT> <BR><FONT
size=2>>_________________________________________________________________</FONT>
<BR><FONT size=2>>Get your FREE download of MSN Explorer at <A
href="http://explorer.msn.com/intl.asp"
target=_blank>http://explorer.msn.com/intl.asp</A></FONT> <BR><FONT
size=2>></FONT> </P></BLOCKQUOTE></BODY></HTML>
------=_NextPart_000_0011_01C17BF1.AA6C3EC0--
More information about the Corpora
mailing list