[Corpora-List] Sentence boundary detection

Kevin B. Cohen kevin.cohen at gmail.com
Tue Jul 24 15:39:34 UTC 2007


On 7/24/07, Andy Roberts <andyr at comp.leeds.ac.uk> wrote:
> It's not been under any manjor evaluation by myself, but my jTokeniser
> Java library has a sentence segmentation module. I'm utilising Java's
> built-in text processing libraries (which were donated by IBM's ICU4J
> project) to do all the hard work.

We had good luck with Andy's jTokeniser in a corpus refactoring
project recently.  The inputs were biomedical texts, which present
some unique weirdness, and it performed well.  I don't have
quantitative data.  We *do* have some quantitative data on the
performance of the LingPipe sentence splitter, and it performs very
nicely in head-to-head comparisons with other systems.

Kev

-- 
K. B. Cohen
Biomedical Text Mining Group Lead
Center for Computational Pharmacology
303-724-7563 (office) 303-916-2417 (cell) 303-377-9194 (home)
http://compbio.uchsc.edu/Hunter_lab/Cohen

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list