[Corpora-List] tokenizer & sentence boundary detection
Alberto Simões
albie at alfarrabio.di.uminho.pt
Mon Jun 14 14:27:46 UTC 2010
Hello.
Not sure as far as we can consider it *good*, but we use the Perl module
Lingua::PT::PLNbase for the Portuguese language. It has some good
heuristics, namely with names and standard abbreviations.
Also, we are always open to bug reports or enhancement requests.
Cheers
Alberto
On 14/06/2010 13:54, Joerg Tiedemann wrote:
>
> I'm looking for freely available tokenizers and sentence splitters for
> various languages. I am interested in language-specific and
> language-independent/generic tools. I am also interested in
> domain-specific tokenizers - anything (off-the-shelf) that can easily be
> used on large scale corpora.
>
> Reply directly to me. I will send a summary to the list later on.
> Below you can see my (very incomplete) initial list.
>
> Thanks,
> Jörg
>
>
> -------------------------------------------------------------------------
>
> Moses/Europarl tokenizer
> http://www.statmt.org/wmt10/scripts.tgz
>
>
> Europarl sentence splitter as Perl modules:
> http://code.google.com/p/corpus-tools/downloads/list
> http://search.cpan.org/~achimru/Lingua-Sentence-1.01/lib/Lingua/Sentence.pm
>
>
> Other Perl modules:
> http://search.cpan.org/~shlomoy/Lingua-EN-Sentence-0.25/lib/Lingua/EN/Sentence.pm
>
> http://search.cpan.org/~holsten/Lingua-DE-Sentence-0.07/Sentence.pm
> http://search.cpan.org/~shlomoy/Lingua-HE-Sentence-0.13/lib/Lingua/HE/Sentence.pm
>
>
>
> Punkt
> implemented in NLTK (Apache license)
> http://www.nltk.org/
> trainable (unsupervised)
> existing models for different languages (?)
>
>
> OpenNLP (GPL)
> http://opennlp.sourceforge.net/
> trainable tokenizer & sentence boundary detector
> models available for English, German, Spanish, Thai
>
>
> huntoken (License?)
> http://mokk.bme.hu/resources/huntoken
> mainly for Hungarian (?)
>
>
> Jena NLP tools
> http://www.julielab.de/Resources/Software/NLP+Tools.html
> trainable tokenizer & sentence splitter
>
>
> FreeLing (GPL)
> http://www.lsi.upc.edu/~nlp/freeling
> regexp tokenizer
> (mainly for Catalan & Spanish?)
>
>
> Alpino for Dutch (tokenization + sentence splitting)
> http://www.let.rug.nl/vannoord/alp/Alpino/
>
>
> Ellogon (LGPL)
> http://www.ellogon.org
>
>
> ChaSen for Japanese
> http://chasen-legacy.sourceforge.jp/
>
>
> MXPOST & MXTERMINATOR (research only!)
> ftp://ftp.cis.upenn.edu/pub/adwait/jmx/
> trainable sentence splitter
>
>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
--
Alberto Simões
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list