[Corpora-List] Sentence Splitter tool

Bill_Lang(Gmail) billlangjun at gmail.com
Mon Oct 29 11:06:47 UTC 2007


 


Hi Naveed,

         NLTK provides a class named as PunktSentenceTokenizer for sentence
split. The iintroduction of it is as following:


Class PunktSentenceTokenizer


A sentence tokenizer which uses an unsupervised algorithm to build a model
for abbreviation words, collocations, and words that start sentences; and
then uses that model to find sentence boundaries. This approach has been
shown to work well for many European languages.

There is some demo code in python:

----------------------------------------------------------------------------
-----

import nltk.data

 

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

fp  = open("test.txt")

data = fp.read()

print '\n-----\n'.join(tokenizer.tokenize(data))

----------------------------------------------------------------------------
-----

 

  _____  

From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of
Afzal, Naveed
Sent: Monday, October 29, 2007 5:48 PM
To: corpora at uib.no
Subject: [Corpora-List] Sentence Splitter tool

 

I am looking for sentence splitter tool .... can any one help me out
regarding this?

 

Thanks,

Naveed

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20071029/73088dd1/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list