[Corpora-List] summary: tokenization & sentence boundary detection

Tue Aug 10 21:21:47 UTC 2010

I just realized that this mail (from some time ago) went to the wrong 
e-mail address. Here it is again (see below).

By the way, are there freely available test sets for evaluating 
tokenization and sentence boundary detection? I would like to check 
performance for several languages and various domains.

Thanks again!
Jörg

-------- Original Message --------
Subject: (preliminary) summary: tokenization & sentence boundary detection
Date: Wed, 30 Jun 2010 15:16:54 +0200
From: Jörg Tiedemann <jorg.tiedemann at lingfil.uu.se>
To: corpora-owner at uib.no

Thanks a lot for all your replies to my query on
tokenization/segmentation tools! Here is a summary of the responses I've 
got so far (including the original list in no particular order):

GATE (LGPL)
variety of tokenizers and splitters (generic & language specific)
http://gate.ac.uk/

MorphAdorner
http://morphadorner.northwestern.edu/
English only

"tokenize.pl" script from the WCDG parser:
http://nats-www.informatik.uni-hamburg.de/view/CDG/DownloadPage
(even de-hyphenation when used together with the parser's lexicon)

Java-based program, Segment
https://sourceforge.net/projects/segment/ (MIT-type licence)
SRX rules for sentence splitting, includes a library for
sentence splitting, which is used by LanguageTool and the Maligna
sentence aligner
C++ library in development (GPL)

Mecab (successor of Chasen)
http://mecab.sourceforge.net/
Japanese

Juman
http://www-lab25.kuee.kyoto-u.ac.jp/nl-resource/juman-e.html
includes dependency parsing etc
Japanese

IceNLP is open source
http://icenlp.sourceforge.net
tokenizer/sentence segmentizer module, part of IceNLP - a toolkit for
Icelandic

Lingua::PT::PLNbase
Portuguese
heuristics with names and standard abbreviations

http://www.cis.uni-muenchen.de/~wastl/misc/tokenizer.tgz
fast, rule-based, tokenizer + sentence boundary detector
German, Russian, English
Sebastian Nagel <wastl.nagel at googlemail.com>

SentTrick (GPLv3)
http://sourceforge.net/projects/sentrick/
sentence boundary detector for German, trainable

fullstop
http://hackage.haskell.org/package/fullstop
English sentence segmenter in Haskell

Grammatical Framework tool
http://hackage.haskell.org/package/toktok

MADA + TOKAN
http://www1.ccls.columbia.edu/~cadim/MADA.html
Arabic

Moses/Europarl tokenizer
http://www.statmt.org/wmt10/scripts.tgz

Europarl sentence splitter as Perl modules:
http://code.google.com/p/corpus-tools/downloads/list
http://search.cpan.org/~achimru/Lingua-Sentence-1.01/lib/Lingua/Sentence.pm

Other Perl modules:
http://search.cpan.org/~shlomoy/Lingua-EN-Sentence-0.25/lib/Lingua/EN/Sentence.pm
http://search.cpan.org/~holsten/Lingua-DE-Sentence-0.07/Sentence.pm
http://search.cpan.org/~shlomoy/Lingua-HE-Sentence-0.13/lib/Lingua/HE/Sentence.pm

Punkt
implemented in NLTK (Apache license)
http://www.nltk.org/
trainable (unsupervised)
existing models for different languages (?)

OpenNLP (GPL)
http://opennlp.sourceforge.net/
trainable tokenizer & sentence boundary detector
models available for English, German, Spanish, Thai
further models to come, wiki at:
https://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Main_Page

huntoken (License?)
http://mokk.bme.hu/resources/huntoken
mainly for Hungarian (?)

Jena NLP tools
http://www.julielab.de/Resources/Software/NLP+Tools.html
trainable tokenizer & sentence splitter

FreeLing (GPL)
http://www.lsi.upc.edu/~nlp/freeling
regexp tokenizer
(mainly for Catalan & Spanish?)

Alpino for Dutch (tokenization + sentence splitting)
http://www.let.rug.nl/vannoord/alp/Alpino/

Ellogon (LGPL)
http://www.ellogon.org

ChaSen for Japanese (successor: mecab (see above))
http://chasen-legacy.sourceforge.jp/

MXPOST & MXTERMINATOR (research only!)
ftp://ftp.cis.upenn.edu/pub/adwait/jmx/
trainable sentence splitter

*******/\/\/\/\/\/\/\/\/\/\/\******************************************
  Jörg Tiedemann                      jorg.tiedemann at lingfil.uu.se
  Dep. of Linguistics and Philology   http://stp.lingfil.uu.se/~joerg/
  Uppsala University                  tel: +46 (0)18 - 471 1412
  Box 635, SE-751 26 Uppsala/SWEDEN   fax: +46 (0)18 - 471 1094
*********************************/\/\/\/\/\/\/\/\/\/\/\****************

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora