[Corpora-List] Sentence Splitter tool

Mon Oct 29 11:40:20 UTC 2007

Hi Naveed,

> From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf 
> Of Afzal, Naveed Sent: 29 October 2007 09:48 To: corpora at uib.no Subject: 
> [Corpora-List] Sentence Splitter tool
>
> I am looking for sentence splitter tool .... can any one help me out 
> regarding this?
>
> Thanks,
> Naveed

This question (and questions on tokenisers) have been asked before on this 
list. I collected snippets of some of the answers for my own benefit, and 
enclose it below, in the hope that it may be of some use.

Best regards,
Eva

-- 
-------------------------------------
Eva Forsbom, Uppsala University/GSLT
E-mail: evafo at stp.lingfil.uu.se
URL: http://stp.lingfil.uu.se/~evafo/
Telephone: +46 (0)18 471 70 06
Fax: +46 (0)18 471 14 16
Address:
Dept. of Linguistics and Philology
Box 635
SE-751 26 Uppsala
SWEDEN

Snippets on sentence splitters and tokenisers collected from corpora list 
and elsewhere:

Dan Roth:

     A pretty good sentence splitter can be downloaded also from http://L2R.cs.uiuc.edu/~cogcomp/cc-software.html

Miles Osborne:

     if you mean code for segmenting text into sentences, then here are a few links:
     Adwait Ratnaparkhi's MXTERMINATOR:
     http://www.cis.upenn.edu/~adwait/statnlp.html
     the LTG TTT system might be useful:
     http://www.ltg.ed.ac.uk/software/ttt/index.html

Tony Rose :

     There's a simple perl5 sentence splitter available at: http://search.cpan.org/author/TGROSE/HTML-Summary-0.017/
     Don't know about good, but it's certainly free :)

Joerg Schuster:

     I have also asked for sentencizers very recently. Here is a summary:

     Name/Nickname |Author |Web Site |Comment |

         * ave |Ave Wrigley |http://search.cpan.org/author/TGROSE/HTML-Summary-0.017/|perl module
         * mxterminator |Adwait |http://www.cis.upenn.edu/~adwait/statnlp.html |java, | | |Ratnaparkhi | |probabilistic|
         * satz |David |http://elib.cs.berkeley.edu/src/satz/ |written in c,| | |D. Palmer | |has to be trained |
         * sentence.cgi |? |http://misshoover.si.umich.edu/~zzheng/sentence/ |cgi script |
         * shlomo |Shlomo Yona |http://search.cpan.org/author/SHLOMOY/ |perl module | | | |Lingua-EN-Sentence-0.25/lib/Lingua/EN/Sentence.pm | |
         * ttt |? |http://www.ltg.ed.ac.uk/software/ttt/index.html |Seems to be available only for SPARC machines |

     You can test the programs ave, mxterminator and shlomo here: http://www.cis.uni-muenchen.de/~js/sentencize

     If you do non-trivial tests, please let me know the results.

Staffan Hermansson:

     Hello people. Here's a brief summary of the things I've recieved. Some people were nice enough to attach documents. I've located most of those on the web for you. Again, thank you for your support.

     Applications: A free CPAN Perl module for sentence splitting. http://listserv.linguistlist.org/cgi-bin/wa?A2=ind0302&L=corpora&P=R5743 Shlomo Yona maintains another perl-based sentence splitter. http://cs.haifa.ac.il/~shlomo/ Earlier posts on this list (might have missed some): http://helmer.aksis.uib.no/corpora/1998-4/0026.html http://helmer.aksis.uib.no/corpora/1999-3/0347.html http://helmer.aksis.uib.no/corpora/2000-2/0225.html http://helmer.aksis.uib.no/corpora/2003-1/0140.html

     Reports:
     Ghassan Mourad was nice and attached the following to me. Though I can't read a word in French (thanks anyway), it might still be of interrest.

         * Ghassan Mourad (1999) La segmentation de textes par l'Ã©tude de la ponctuation http://www.lalic.paris4.sorbonne.fr/articles/1998-1999/Mourad/CIDE99.pdf
         * Ghassan Mourad La segmentation de textes par exploration contextuelle automatique, prÃ©sentation du module SegATex
         * Greg Grefenstette and Past Tapanainen. "What is a word, what is a sentence? Problems of tokenization." http://citeseer.nj.nec.com/grefenstette94what.html
         * Tibor Kiss and Jan Strunk Scaled log likelihood ratios for the detection of abbreviations in text corpora http://www.linguistics.rub.de/~kiss/publications/abbrev.pdf
         * Tibor Kiss and Jan Strunk Multilingual Least-Effort Sentence Boundary Disambiguation http://www.linguistics.rub.de/~kiss/publications/publications.html#boundaries
         * Andrei Mikheev. "Text Segmentation." In R. Mitkov (ed.) Oxford Handbook of Computational Linguistics, OUP, 2003.
         * Andrei Mikheev Tagging Sentence Boundaries (2000) http://citeseer.nj.nec.com/mikheev00tagging.html
         * Andrei Mikheev Periods, Capitalized Words, etc (1999) http://citeseer.nj.nec.com/mikheev99periods.html
         * David D. Palmer (2000) Tokenisation and Sentence Segmentation, Robert Dale, Hermann Moisl and Harold Somers (Eds) in A Handbook of Natural Language Processing, Marcel Dekker David D. Palmer and Marti A. Hearst, Adaptive Multilingual Sentence Boundary Disambiguation http://citeseer.nj.nec.com/palmer97adaptive.html
         * J. Reynar and A. Ratnaparkhi, A Maximum Entropy Approach to Identifying Sentence Boundaries http://citeseer.nj.nec.com/article/reynar97maximum.html

http://www.cs.rochester.edu/u/tetreaul/academic.html

        1. Sentence Splitters

     * Satz Adaptive sentence boundary detector (C) (David Palmer and Marti Hearst)
     * Dan Roth's splitter
     * shlomoy Perl5 splitter
     * tgrose: sentence perl module
     * MXTERMINATOR (Adwait Ratnaparkhi)
     * LGT TTT system
     * Zhiping Zheng's cgi splitter
     * Guenther cgi script
     * Interactive Sentence Aligner (Joerg Tiedemann)
     * Russian Sentence C++ Splitter (download) dll is here
     * English rule-based Java sentence splitter (Scott Piao)

     (links)
     check the corpora-list archives:

     http://listserv.linguistlist.org/cgi-bin/wa?S1=corpora

Patrick Tschorn:

     I am pleased to announce the immediate availability of Sentrick, a sentence
     boundary detection program for German.

     http://www.denkselbst.de/sbdniffler/sentrick.html

     Sentrick requires Java 1.5, processes plain text, handles a variety of punctuation
     characters (including quotes) and is licensed under the GNU GPL.

Scott Songlin Piao:

     I put my English sentence splitor on the website:
     http://text0.mib.man.ac.uk:8080/sentencebreaker/heuristic_tool

     It is rule-based Java program and is downloadable.

     I put my sentence breaker at the site:
     http://text0.mib.man.ac.uk:8080/scottpiao/sent_detector

     It has performed with very high precisions, including in a commercial
     context. It is for English, I am not sure if it works on Spanish. You
     can try on the
     demo website.

Jason Baldridge

     One fairly easy to use sentence boundary detector and tokenizer is
     included in the OpenNLP toolkit:

     http://opennlp.sf.net

     It is written in Java and is basically the same as Ratnaparkhi's
     detector. Lots of other tools, including parsing, tagging, and
     coreference are in that package. There are already trained models
     available for English. The tools themselves are not language specific, so
     if you provide an appropriate training corpus in Spanish, you can train
     new models easily enough. (And the code is open source, so you can modify
     it to make it more sensitive to another language ( e.g., morphology) if
     you want.)

     For other tools, many of which are geared for Spanish NLP, you might also
     have a look at FreeLing:

     http://garraf.epsevg.upc.es/freeling/

     There are certainly many other tools available  it is actually pretty
     straightforward to whip up a detector from scratch. There are some recent
     unsupervised approaches for sentence boundary detection too that could be
     relevant for you. You might have a look at this article by Tibor Kiss and
     Jan Strunk:

     http://www.linguistics.ruhr-uni-bochum.de/~strunk/ks2005FINAL.pdf

Steven Bird:

     On 7/21/07, Jason Baldridge wrote:

         There are some recent
         unsupervised approaches for sentence boundary detection too that could be
         relevant for you. You might have a look at this article by Tibor Kiss and
         Jan Strunk:

         http://www.linguistics.ruhr-uni-bochum.de/~strunk/ks2005FINAL.pdf

     Note that the Punkt system has been ported to Python and is included with the
     Natural Language Toolkit (http://nltk.org/index.php), in module
     nltk_contrib.punkt

Andy Roberts:

     It's not been under any manjor evaluation by myself, but my jTokeniser
     Java library has a sentence segmentation module. I'm utilising Java's
     built-in text processing libraries (which were donated by IBM's ICU4J
     project) to do all the hard work.

     See http://www.andy-roberts.net/software/jTokeniser/

     There's also a GUI available for you to test the various tokenisers
     interactively.

Katrin Tomanek:

     we have a ML-based sentence splitter/tokenizer. Both are little bit
     optimized for the bio-medical domain (english), but are of course (given
     you have the training material) applicable to other domains.

     Both tools are available in a command-line mode and as UIMA components.
     They can be downloaded from our website: http://julielab.de. You will
     find a reference to our paper on these tools (MEDINFO 2007) on the
     website as well.

Kevin B. Cohen:

     We had good luck with Andy's jTokeniser in a corpus refactoring
     project recently. The inputs were biomedical texts, which present
     some unique weirdness, and it performed well. I don't have
     quantitative data. We *do* have some quantitative data on the
     performance of the LingPipe sentence splitter, and it performs very
     nicely in head-to-head comparisons with other systems.

Mehmet Kayaalp:

     Last year, we examined 13 open source, freeware software packages, which can
     perform NL tokenization (many of which perform sentence boundary detection
     and more) and summarized our experience in a technical report, which is
     accessible at http://lhncbc.nlm.nih.gov/lhc/docs/reports/2006/tr2006003.pdf.
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora