[Corpora-List] Sentence boundary detection
Kevin B. Cohen
kevin.cohen at gmail.com
Tue Jul 24 15:39:34 UTC 2007
On 7/24/07, Andy Roberts <andyr at comp.leeds.ac.uk> wrote:
> It's not been under any manjor evaluation by myself, but my jTokeniser
> Java library has a sentence segmentation module. I'm utilising Java's
> built-in text processing libraries (which were donated by IBM's ICU4J
> project) to do all the hard work.
We had good luck with Andy's jTokeniser in a corpus refactoring
project recently. The inputs were biomedical texts, which present
some unique weirdness, and it performed well. I don't have
quantitative data. We *do* have some quantitative data on the
performance of the LingPipe sentence splitter, and it performs very
nicely in head-to-head comparisons with other systems.
Kev
--
K. B. Cohen
Biomedical Text Mining Group Lead
Center for Computational Pharmacology
303-724-7563 (office) 303-916-2417 (cell) 303-377-9194 (home)
http://compbio.uchsc.edu/Hunter_lab/Cohen
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list