[Corpora-List] Sentence segmenting

Steven Bird sb at csse.unimelb.edu.au
Tue Aug 14 08:17:47 UTC 2012


On 13 August 2012 23:35, Jeff Elmore <jelmore at lexile.com> wrote:
> I have checked
> out what NLTK offers but from what I've seen there's not anything terribly
> accurate in it (fails on obvious common cases like some honorifics).

Note that NLTK just uses Punkt, and this won't necessarily perform
well if it uses an off-the-shelf model that was trained on data that
contained different abbreviations to the test data:

"Punkt is designed to learn parameters (a list of abbreviations, etc.)
unsupervised from a corpus similar to the target domain. The
pre-packaged models may therefore be unsuitable: use
PunktSentenceTokenizer(text) to learn parameters from the given text."
http://nltk.org/api/nltk.tokenize.html

-Steven Bird

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list