<div>Hi everybody,</div>

<div> </div>

<div>I am currently embarking on a research project aiming at building a large corpus of English by automatic crawls of the web. For this purpose I would be interested in having some suggestions about an efficient tokenizer for English. This should in some way take into account specific aspects of Web writing (such as the treatment of emoticons, typos, commonly used abbreviations, etc.). Does anyone know about a similar tool?

</div>

<div> </div>

<div>I will provide a resume of the answers I (hopefully!) will get.</div>

<div> </div>

<div>Thank you.</div>

<div> </div>

<div>Adriano Ferraresi</div>