[Corpora-List] Sentence segmenting

Diana Maynard d.maynard at dcs.shef.ac.uk
Mon Aug 13 14:23:41 UTC 2012


Hi Jeff
The sentence splitter in GATE is pretty accurate, especially for 
English. You can easily improve it for any language by adding your own 
abbreviation list or editing the existing one. The issues that usually 
foil it are related to line breaks in less formal kinds of documents, 
such as forum messages (but there are a couple of alternative versions 
of the splitter for just such an eventuality).
Diana

On 13/08/12 14:35, Jeff Elmore wrote:
> I'm curious what folks are using these days for sentence segmenting for
> English.
>
> My application involves narrative and informational texts at a variety
> of reading levels and genres. Most text is hand-edited to eliminate
> non-prose content but any system that could respond robustly to unedited
> text would be awesome, of course.
>
> Mostly we've been using hand-crafted tools written in Python. I have
> checked out what NLTK offers but from what I've seen there's not
> anything terribly accurate in it (fails on obvious common cases like
> some honorifics). We did develop a decision tree based model using Weka
> for Spanish text. I'd be happy to do this again for English but wanted
> to see if there's something good already out there.
>
> Thanks in advance!

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list