[Corpora-List] tokenizer & sentence boundary detection
Joerg Tiedemann
jorg.tiedemann at lingfil.uu.se
Mon Jun 14 12:54:56 UTC 2010
I'm looking for freely available tokenizers and sentence splitters for
various languages. I am interested in language-specific and
language-independent/generic tools. I am also interested in
domain-specific tokenizers - anything (off-the-shelf) that can easily be
used on large scale corpora.
Reply directly to me. I will send a summary to the list later on.
Below you can see my (very incomplete) initial list.
Thanks,
Jörg
-------------------------------------------------------------------------
Moses/Europarl tokenizer
http://www.statmt.org/wmt10/scripts.tgz
Europarl sentence splitter as Perl modules:
http://code.google.com/p/corpus-tools/downloads/list
http://search.cpan.org/~achimru/Lingua-Sentence-1.01/lib/Lingua/Sentence.pm
Other Perl modules:
http://search.cpan.org/~shlomoy/Lingua-EN-Sentence-0.25/lib/Lingua/EN/Sentence.pm
http://search.cpan.org/~holsten/Lingua-DE-Sentence-0.07/Sentence.pm
http://search.cpan.org/~shlomoy/Lingua-HE-Sentence-0.13/lib/Lingua/HE/Sentence.pm
Punkt
implemented in NLTK (Apache license)
http://www.nltk.org/
trainable (unsupervised)
existing models for different languages (?)
OpenNLP (GPL)
http://opennlp.sourceforge.net/
trainable tokenizer & sentence boundary detector
models available for English, German, Spanish, Thai
huntoken (License?)
http://mokk.bme.hu/resources/huntoken
mainly for Hungarian (?)
Jena NLP tools
http://www.julielab.de/Resources/Software/NLP+Tools.html
trainable tokenizer & sentence splitter
FreeLing (GPL)
http://www.lsi.upc.edu/~nlp/freeling
regexp tokenizer
(mainly for Catalan & Spanish?)
Alpino for Dutch (tokenization + sentence splitting)
http://www.let.rug.nl/vannoord/alp/Alpino/
Ellogon (LGPL)
http://www.ellogon.org
ChaSen for Japanese
http://chasen-legacy.sourceforge.jp/
MXPOST & MXTERMINATOR (research only!)
ftp://ftp.cis.upenn.edu/pub/adwait/jmx/
trainable sentence splitter
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list