[Corpora-List] Sentence segmenting

Florian Leitner fleitner at cnio.es
Tue Aug 14 08:21:11 UTC 2012


As nobody has mentioned this option so far, Apache OpenNLP has a sentence splitter, too, and it comes with a wrapper for the UIMA framework in case you are using that. It is MEMM-based and requires no prior (and, if I may say so, "subjective") tokenization. It is fairly easy to use it from within GATE, too, although you have to make some minor modifications to their wrapper if you want to use the latest version. The Markov model implementation is extremely efficient and there are pre-trained models for English, German, Dutch, Portuguese, and Swedish available if you have not domain specific training data. Last but not least, it is made available under the extremely permissive Apache 2 license, is supported as a full-fledged Apache project (i.e., not incubating), and generally fits very nicely into the entire Apache data mining "eco-system".

-Florian


On 13 Aug 2012, at 15:35, Jeff Elmore wrote:

> I'm curious what folks are using these days for sentence segmenting for English.
> 
> My application involves narrative and informational texts at a variety of reading levels and genres. Most text is hand-edited to eliminate non-prose content but any system that could respond robustly to unedited text would be awesome, of course.
> 
> Mostly we've been using hand-crafted tools written in Python. I have checked out what NLTK offers but from what I've seen there's not anything terribly accurate in it (fails on obvious common cases like some honorifics). We did develop a decision tree based model using Weka for Spanish text. I'd be happy to do this again for English but wanted to see if there's something good already out there.
> 
> Thanks in advance!
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

-- 
Florian Leitner, PhD <fleitner.cnio at gmail.com>

Structural Biology and BioComputing Programme
Spanish National Cancer Research Centre (CNIO)

Address: C/ Melchor Fernandez Almagro 3; E-28029 Madrid
Phone: +34 91 732 8000
Fax: +34 91 224 6980
Internet: http://www.cnio.es

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20120814/a7008a9b/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list