[Corpora-List] tokenizer & sentence boundary detection

Joerg Tiedemann jorg.tiedemann at lingfil.uu.se
Mon Jun 14 12:54:56 UTC 2010


I'm looking for freely available tokenizers and sentence splitters for 
various languages. I am interested in language-specific and 
language-independent/generic tools. I am also interested in 
domain-specific tokenizers - anything (off-the-shelf) that can easily be 
used on large scale corpora.

Reply directly to me. I will send a summary to the list later on.
Below you can see my (very incomplete) initial list.

Thanks,
Jörg


-------------------------------------------------------------------------

Moses/Europarl tokenizer
http://www.statmt.org/wmt10/scripts.tgz


Europarl sentence splitter as Perl modules:
http://code.google.com/p/corpus-tools/downloads/list
http://search.cpan.org/~achimru/Lingua-Sentence-1.01/lib/Lingua/Sentence.pm


Other Perl modules:
http://search.cpan.org/~shlomoy/Lingua-EN-Sentence-0.25/lib/Lingua/EN/Sentence.pm
http://search.cpan.org/~holsten/Lingua-DE-Sentence-0.07/Sentence.pm
http://search.cpan.org/~shlomoy/Lingua-HE-Sentence-0.13/lib/Lingua/HE/Sentence.pm


Punkt
implemented in NLTK (Apache license)
http://www.nltk.org/
trainable (unsupervised)
existing models for different languages (?)


OpenNLP (GPL)
http://opennlp.sourceforge.net/
trainable tokenizer & sentence boundary detector
models available for English, German, Spanish, Thai


huntoken (License?)
http://mokk.bme.hu/resources/huntoken
mainly for Hungarian (?)


Jena NLP tools
http://www.julielab.de/Resources/Software/NLP+Tools.html
trainable tokenizer & sentence splitter


FreeLing (GPL)
http://www.lsi.upc.edu/~nlp/freeling
regexp tokenizer
(mainly for Catalan & Spanish?)


Alpino for Dutch (tokenization + sentence splitting)
http://www.let.rug.nl/vannoord/alp/Alpino/


Ellogon (LGPL)
http://www.ellogon.org


ChaSen for Japanese
http://chasen-legacy.sourceforge.jp/


MXPOST & MXTERMINATOR (research only!)
ftp://ftp.cis.upenn.edu/pub/adwait/jmx/
trainable sentence splitter



_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list