[Corpora-List] tokenizer & sentence boundary detection

Adam Radziszewski kocikikut at gmail.com
Mon Jun 14 14:49:22 UTC 2010


>
> There's the Java-based program, Segment
> (https://sourceforge.net/projects/segment/ MIT-type licence) which
> uses SRX rules for sentence splitting. It includes a library for
> sentence splitting, which is used by LanguageTool and the Maligna
> sentence aligner.
We are currently finishing work on a C++ library for tokenisation and
sentence splitting. Tokeniser is defined in an INI file, where
subsequent processing layers are specified (the input is first
tokenised by white spaces). Our sentence splitter is another
implementation of SRX rules, as far as we know, the first one in C++
(using ICU regexen). The splitter is not faster than the mentioned
Java segment library (segment uses lots of fancy tricks with
lookbehinds which are not available in ICU), nevertheless it may be
more convenient to use if you don't want to use Java (employing JNI in
such a low-level processing step seems an overkill). If you're
interested, we'll send in the link as soon as it is released (should
be GPL'd in days if I know the project policy well).

BTW the SRX standard is somewhat crippled by the very fact that the
rule syntax and semantics definitions are based on a particular
implementation of regexen, namely Java standard library. This is
inconsistent with the "vendor-neutrality" proclaimed in the specs.

Adam Radziszewski

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list