<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">

<HTML><HEAD>

<META http-equiv=Content-Type content="text/html; charset=us-ascii">

<META content="MSHTML 6.00.6000.16414" name=GENERATOR></HEAD>

<BODY>

<DIV dir=ltr align=left><SPAN class=652061804-14032007><FONT face=Arial 

color=#0000ff size=2>Further to Adriano's request below, is anyone aware of 

sentence tokenizers/splitters that have been trained on or applied to email 

data? </FONT></SPAN></DIV>

<DIV dir=ltr align=left><SPAN class=652061804-14032007><FONT face=Arial 

color=#0000ff size=2></FONT></SPAN> </DIV>

<DIV dir=ltr align=left><SPAN class=652061804-14032007><FONT face=Arial 

color=#0000ff size=2>Some of the noise in email text will be similar to that of 

web text (emoticons, typos etc.), but there are also specific phenomena 

(greetings, email signatures, dealing with quoted material etc.) that seem to 

require techniques tailored to email.</FONT></SPAN></DIV>

<DIV dir=ltr align=left><SPAN class=652061804-14032007><FONT face=Arial 

color=#0000ff size=2></FONT></SPAN> </DIV>

<DIV dir=ltr align=left><SPAN class=652061804-14032007><FONT face=Arial 

color=#0000ff size=2>I await your summary of responses with interest, 

Adriano.</FONT></SPAN></DIV>

<DIV dir=ltr align=left><SPAN class=652061804-14032007><FONT face=Arial 

color=#0000ff size=2></FONT></SPAN> </DIV>

<DIV dir=ltr align=left><SPAN class=652061804-14032007><FONT face=Arial 

color=#0000ff size=2>Are there any additional pointers that people can 

offer, specifically with regard to processing email text? 

</FONT></SPAN><SPAN class=652061804-14032007><FONT face=Arial color=#0000ff 

size=2></FONT></SPAN></DIV>

<DIV dir=ltr align=left><SPAN class=652061804-14032007><FONT face=Arial 

color=#0000ff size=2></FONT></SPAN> </DIV>

<DIV dir=ltr align=left><SPAN class=652061804-14032007><FONT face=Arial 

color=#0000ff size=2>Thanks,</FONT></SPAN></DIV>

<DIV dir=ltr align=left><SPAN class=652061804-14032007><FONT face=Arial 

color=#0000ff size=2>Andrew Lampert</FONT></SPAN></DIV>

<DIV dir=ltr align=left><SPAN class=652061804-14032007><!-- Converted from text/plain format -->

<P><FONT size=2>--------------<BR>Andrew Lampert<BR>Research 

Engineer<BR>Information Engineering Laboratory<BR>CSIRO ICT Centre<BR><<A 

href="http://www.ict.csiro.au/staff/Andrew.Lampert/">http://www.ict.csiro.au/staff/Andrew.Lampert/</A>><BR><BR>Post: 

Locked Bag 17, North Ryde, NSW 1670, Australia<BR>Office: Building E6B, 

Macquarie University, North Ryde, 2113<BR>Tel: +61 2 9325 3129, Fax: +61 2 9325 

3200<BR> </FONT> </P></SPAN></DIV><BR>

<DIV class=OutlookMessageHeader lang=en-us dir=ltr align=left>

<HR tabIndex=-1>

<FONT face=Tahoma size=2><B>From:</B> owner-corpora@lists.uib.no 

[mailto:owner-corpora@lists.uib.no] <B>On Behalf Of </B>Adriano 

Ferraresi<BR><B>Sent:</B> Tuesday, 13 March 2007 10:40 PM<BR><B>To:</B> 

CORPORA@UIB.NO<BR><B>Subject:</B> [Corpora-List] Tokenizer for English Web 

Corpus<BR></FONT><BR></DIV>

<DIV></DIV>

<DIV>Hi everybody,</DIV>

<DIV> </DIV>

<DIV>I am currently embarking on a research project aiming at building a large 

corpus of English by automatic crawls of the web. For this purpose I would be 

interested in having some suggestions about an efficient tokenizer for 

English. This should in some way take into account specific aspects of Web 

writing (such as the treatment of emoticons, typos, commonly used abbreviations, 

etc.). Does anyone know about a similar tool? </DIV>

<DIV> </DIV>

<DIV>I will provide a resume of the answers I (hopefully!) will get.</DIV>

<DIV> </DIV>

<DIV>Thank you.</DIV>

<DIV> </DIV>

<DIV>Adriano Ferraresi</DIV></BODY></HTML>