[Corpora-List] summary: tokenization & sentence boundary detection

Kevin B. Cohen kevin.cohen at gmail.com
Thu Aug 12 15:11:31 UTC 2010


Joerg,

By the way, are there freely available test sets for evaluating tokenization
> and sentence boundary detection? I would like to check performance for
> several languages and various domains.
>
>
Since you're interested in various domains, try the GENIA corpus for
biomedical text--biological journal article abstracts, specifically.  It's
good both for tokenization and for sentence boundaries.

Kev

-- 
Kevin Bretonnel Cohen, PhD
Biomedical Text Mining Group Lead, Center for Computational Pharmacology, U.
Colorado School of Medicine
and
Lead Artificial Intelligence Engineer, The MITRE Corporation, Human Language
Technology Division
303-916-2417 (cell) 303-377-9194 (home)
http://compbio.ucdenver.edu/Hunter_lab/Cohen
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20100812/991fcb02/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list