[Corpora-List] Sentence segmenting

Thanh-Le Ha leht82 at gmail.com
Tue Aug 14 08:35:58 UTC 2012


Hi Jeff,

I also tried the pre-trained sentence segmentation of NLTK before and did
not satisfy with the quality either. I turned to Splitta (
http://code.google.com/p/splitta/), mentioned by Aleksandar above and it's
really good for English. It haven't trained on other languages, though, but
for your requirements, I think Splitta is worth to try.

--Le.

On Tue, Aug 14, 2012 at 10:17 AM, Steven Bird <sb at csse.unimelb.edu.au>wrote:

> On 13 August 2012 23:35, Jeff Elmore <jelmore at lexile.com> wrote:
> > I have checked
> > out what NLTK offers but from what I've seen there's not anything
> terribly
> > accurate in it (fails on obvious common cases like some honorifics).
>
> Note that NLTK just uses Punkt, and this won't necessarily perform
> well if it uses an off-the-shelf model that was trained on data that
> contained different abbreviations to the test data:
>
> "Punkt is designed to learn parameters (a list of abbreviations, etc.)
> unsupervised from a corpus similar to the target domain. The
> pre-packaged models may therefore be unsuitable: use
> PunktSentenceTokenizer(text) to learn parameters from the given text."
> http://nltk.org/api/nltk.tokenize.html
>
> -Steven Bird
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20120814/33432d60/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list