<br>Joerg,<br><div class="gmail_quote"><br><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">

By the way, are there freely available test sets for evaluating tokenization and sentence boundary detection? I would like to check performance for several languages and various domains.<br>

<br></blockquote><div><br>Since you're interested in various domains, try the GENIA corpus for biomedical text--biological journal article abstracts, specifically.  It's good both for tokenization and for sentence boundaries.<br>


<br>Kev<br><br>-- <br></div></div>Kevin Bretonnel Cohen, PhD<br>Biomedical Text Mining Group Lead, Center for Computational Pharmacology, U. Colorado School of Medicine<br>and<br>Lead Artificial Intelligence Engineer, The MITRE Corporation, Human Language Technology Division<br>


303-916-2417 (cell) 303-377-9194 (home)<br><a href="http://compbio.ucdenver.edu/Hunter_lab/Cohen">http://compbio.ucdenver.edu/Hunter_lab/Cohen</a><br><br>