[Corpora-List] summary: tokenization & sentence boundary detection
Joerg Tiedemann
jorg.tiedemann at lingfil.uu.se
Tue Aug 10 21:21:47 UTC 2010
I just realized that this mail (from some time ago) went to the wrong
e-mail address. Here it is again (see below).
By the way, are there freely available test sets for evaluating
tokenization and sentence boundary detection? I would like to check
performance for several languages and various domains.
Thanks again!
Jörg
-------- Original Message --------
Subject: (preliminary) summary: tokenization & sentence boundary detection
Date: Wed, 30 Jun 2010 15:16:54 +0200
From: Jörg Tiedemann <jorg.tiedemann at lingfil.uu.se>
To: corpora-owner at uib.no
Thanks a lot for all your replies to my query on
tokenization/segmentation tools! Here is a summary of the responses I've
got so far (including the original list in no particular order):
GATE (LGPL)
variety of tokenizers and splitters (generic & language specific)
http://gate.ac.uk/
MorphAdorner
http://morphadorner.northwestern.edu/
English only
"tokenize.pl" script from the WCDG parser:
http://nats-www.informatik.uni-hamburg.de/view/CDG/DownloadPage
(even de-hyphenation when used together with the parser's lexicon)
Java-based program, Segment
https://sourceforge.net/projects/segment/ (MIT-type licence)
SRX rules for sentence splitting, includes a library for
sentence splitting, which is used by LanguageTool and the Maligna
sentence aligner
C++ library in development (GPL)
Mecab (successor of Chasen)
http://mecab.sourceforge.net/
Japanese
Juman
http://www-lab25.kuee.kyoto-u.ac.jp/nl-resource/juman-e.html
includes dependency parsing etc
Japanese
IceNLP is open source
http://icenlp.sourceforge.net
tokenizer/sentence segmentizer module, part of IceNLP - a toolkit for
Icelandic
Lingua::PT::PLNbase
Portuguese
heuristics with names and standard abbreviations
http://www.cis.uni-muenchen.de/~wastl/misc/tokenizer.tgz
fast, rule-based, tokenizer + sentence boundary detector
German, Russian, English
Sebastian Nagel <wastl.nagel at googlemail.com>
SentTrick (GPLv3)
http://sourceforge.net/projects/sentrick/
sentence boundary detector for German, trainable
fullstop
http://hackage.haskell.org/package/fullstop
English sentence segmenter in Haskell
Grammatical Framework tool
http://hackage.haskell.org/package/toktok
MADA + TOKAN
http://www1.ccls.columbia.edu/~cadim/MADA.html
Arabic
Moses/Europarl tokenizer
http://www.statmt.org/wmt10/scripts.tgz
Europarl sentence splitter as Perl modules:
http://code.google.com/p/corpus-tools/downloads/list
http://search.cpan.org/~achimru/Lingua-Sentence-1.01/lib/Lingua/Sentence.pm
Other Perl modules:
http://search.cpan.org/~shlomoy/Lingua-EN-Sentence-0.25/lib/Lingua/EN/Sentence.pm
http://search.cpan.org/~holsten/Lingua-DE-Sentence-0.07/Sentence.pm
http://search.cpan.org/~shlomoy/Lingua-HE-Sentence-0.13/lib/Lingua/HE/Sentence.pm
Punkt
implemented in NLTK (Apache license)
http://www.nltk.org/
trainable (unsupervised)
existing models for different languages (?)
OpenNLP (GPL)
http://opennlp.sourceforge.net/
trainable tokenizer & sentence boundary detector
models available for English, German, Spanish, Thai
further models to come, wiki at:
https://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Main_Page
huntoken (License?)
http://mokk.bme.hu/resources/huntoken
mainly for Hungarian (?)
Jena NLP tools
http://www.julielab.de/Resources/Software/NLP+Tools.html
trainable tokenizer & sentence splitter
FreeLing (GPL)
http://www.lsi.upc.edu/~nlp/freeling
regexp tokenizer
(mainly for Catalan & Spanish?)
Alpino for Dutch (tokenization + sentence splitting)
http://www.let.rug.nl/vannoord/alp/Alpino/
Ellogon (LGPL)
http://www.ellogon.org
ChaSen for Japanese (successor: mecab (see above))
http://chasen-legacy.sourceforge.jp/
MXPOST & MXTERMINATOR (research only!)
ftp://ftp.cis.upenn.edu/pub/adwait/jmx/
trainable sentence splitter
*******/\/\/\/\/\/\/\/\/\/\/\******************************************
Jörg Tiedemann jorg.tiedemann at lingfil.uu.se
Dep. of Linguistics and Philology http://stp.lingfil.uu.se/~joerg/
Uppsala University tel: +46 (0)18 - 471 1412
Box 635, SE-751 26 Uppsala/SWEDEN fax: +46 (0)18 - 471 1094
*********************************/\/\/\/\/\/\/\/\/\/\/\****************
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list