[Corpora-List] tokenizer & sentence boundary detection

Jimmy O'Regan joregan at gmail.com
Mon Jun 14 13:35:34 UTC 2010


On 14 June 2010 13:54, Joerg Tiedemann <jorg.tiedemann at lingfil.uu.se> wrote:
>
> I'm looking for freely available tokenizers and sentence splitters for
> various languages. I am interested in language-specific and
> language-independent/generic tools. I am also interested in domain-specific
> tokenizers - anything (off-the-shelf) that can easily be used on large scale
> corpora.

There's the Java-based program, Segment
(https://sourceforge.net/projects/segment/ MIT-type licence) which
uses SRX rules for sentence splitting. It includes a library for
sentence splitting, which is used by LanguageTool and the Maligna
sentence aligner.


-- 
<Leftmost> jimregan, that's because deep inside you, you are evil.
<Leftmost> Also not-so-deep inside you.

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list