[Corpora-List] Sentence Splitter tool
Bill_Lang(Gmail)
billlangjun at gmail.com
Mon Oct 29 11:06:47 UTC 2007
Hi Naveed,
NLTK provides a class named as PunktSentenceTokenizer for sentence
split. The iintroduction of it is as following:
Class PunktSentenceTokenizer
A sentence tokenizer which uses an unsupervised algorithm to build a model
for abbreviation words, collocations, and words that start sentences; and
then uses that model to find sentence boundaries. This approach has been
shown to work well for many European languages.
There is some demo code in python:
----------------------------------------------------------------------------
-----
import nltk.data
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
fp = open("test.txt")
data = fp.read()
print '\n-----\n'.join(tokenizer.tokenize(data))
----------------------------------------------------------------------------
-----
_____
From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of
Afzal, Naveed
Sent: Monday, October 29, 2007 5:48 PM
To: corpora at uib.no
Subject: [Corpora-List] Sentence Splitter tool
I am looking for sentence splitter tool .... can any one help me out
regarding this?
Thanks,
Naveed
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20071029/73088dd1/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list