<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=Content-Type content="text/html; charset=us-ascii">
<META content="MSHTML 6.00.6000.16414" name=GENERATOR></HEAD>
<BODY>
<DIV dir=ltr align=left><SPAN class=652061804-14032007><FONT face=Arial
color=#0000ff size=2>Further to Adriano's request below, is anyone aware of
sentence tokenizers/splitters that have been trained on or applied to email
data? </FONT></SPAN></DIV>
<DIV dir=ltr align=left><SPAN class=652061804-14032007><FONT face=Arial
color=#0000ff size=2></FONT></SPAN> </DIV>
<DIV dir=ltr align=left><SPAN class=652061804-14032007><FONT face=Arial
color=#0000ff size=2>Some of the noise in email text will be similar to that of
web text (emoticons, typos etc.), but there are also specific phenomena
(greetings, email signatures, dealing with quoted material etc.) that seem to
require techniques tailored to email.</FONT></SPAN></DIV>
<DIV dir=ltr align=left><SPAN class=652061804-14032007><FONT face=Arial
color=#0000ff size=2></FONT></SPAN> </DIV>
<DIV dir=ltr align=left><SPAN class=652061804-14032007><FONT face=Arial
color=#0000ff size=2>I await your summary of responses with interest,
Adriano.</FONT></SPAN></DIV>
<DIV dir=ltr align=left><SPAN class=652061804-14032007><FONT face=Arial
color=#0000ff size=2></FONT></SPAN> </DIV>
<DIV dir=ltr align=left><SPAN class=652061804-14032007><FONT face=Arial
color=#0000ff size=2>Are there any additional pointers that people can
offer, specifically with regard to processing email text?
</FONT></SPAN><SPAN class=652061804-14032007><FONT face=Arial color=#0000ff
size=2></FONT></SPAN></DIV>
<DIV dir=ltr align=left><SPAN class=652061804-14032007><FONT face=Arial
color=#0000ff size=2></FONT></SPAN> </DIV>
<DIV dir=ltr align=left><SPAN class=652061804-14032007><FONT face=Arial
color=#0000ff size=2>Thanks,</FONT></SPAN></DIV>
<DIV dir=ltr align=left><SPAN class=652061804-14032007><FONT face=Arial
color=#0000ff size=2>Andrew Lampert</FONT></SPAN></DIV>
<DIV dir=ltr align=left><SPAN class=652061804-14032007><!-- Converted from text/plain format -->
<P><FONT size=2>--------------<BR>Andrew Lampert<BR>Research
Engineer<BR>Information Engineering Laboratory<BR>CSIRO ICT Centre<BR><<A
href="http://www.ict.csiro.au/staff/Andrew.Lampert/">http://www.ict.csiro.au/staff/Andrew.Lampert/</A>><BR><BR>Post:
Locked Bag 17, North Ryde, NSW 1670, Australia<BR>Office: Building E6B,
Macquarie University, North Ryde, 2113<BR>Tel: +61 2 9325 3129, Fax: +61 2 9325
3200<BR> </FONT> </P></SPAN></DIV><BR>
<DIV class=OutlookMessageHeader lang=en-us dir=ltr align=left>
<HR tabIndex=-1>
<FONT face=Tahoma size=2><B>From:</B> owner-corpora@lists.uib.no
[mailto:owner-corpora@lists.uib.no] <B>On Behalf Of </B>Adriano
Ferraresi<BR><B>Sent:</B> Tuesday, 13 March 2007 10:40 PM<BR><B>To:</B>
CORPORA@UIB.NO<BR><B>Subject:</B> [Corpora-List] Tokenizer for English Web
Corpus<BR></FONT><BR></DIV>
<DIV></DIV>
<DIV>Hi everybody,</DIV>
<DIV> </DIV>
<DIV>I am currently embarking on a research project aiming at building a large
corpus of English by automatic crawls of the web. For this purpose I would be
interested in having some suggestions about an efficient tokenizer for
English. This should in some way take into account specific aspects of Web
writing (such as the treatment of emoticons, typos, commonly used abbreviations,
etc.). Does anyone know about a similar tool? </DIV>
<DIV> </DIV>
<DIV>I will provide a resume of the answers I (hopefully!) will get.</DIV>
<DIV> </DIV>
<DIV>Thank you.</DIV>
<DIV> </DIV>
<DIV>Adriano Ferraresi</DIV></BODY></HTML>