[Corpora-List] tokenizer & sentence boundary detection

Jimmy O'Regan joregan at gmail.com
Mon Jun 14 16:42:34 UTC 2010


On Monday, June 14, 2010, Adam Radziszewski <kocikikut at gmail.com> wrote:
>>
>> There's the Java-based program, Segment
>> (https://sourceforge.net/projects/segment/ MIT-type licence) which
>> uses SRX rules for sentence splitting. It includes a library for
>> sentence splitting, which is used by LanguageTool and the Maligna
>> sentence aligner.
> We are currently finishing work on a C++ library for tokenisation and
> sentence splitting. Tokeniser is defined in an INI file, where
> subsequent processing layers are specified (the input is first
> tokenised by white spaces). Our sentence splitter is another
> implementation of SRX rules, as far as we know, the first one in C++
> (using ICU regexen). The splitter is not faster than the mentioned
> Java segment library (segment uses lots of fancy tricks with
> lookbehinds which are not available in ICU), nevertheless it may be
> more convenient to use if you don't want to use Java (employing JNI in
> such a low-level processing step seems an overkill). If you're
> interested, we'll send in the link as soon as it is released (should
> be GPL'd in days if I know the project policy well).
>
> BTW the SRX standard is somewhat crippled by the very fact that the
> rule syntax and semantics definitions are based on a particular
> implementation of regexen, namely Java standard library. This is
> inconsistent with the "vendor-neutrality" proclaimed in the specs.
>

Odd; my recollection was that the regex variant specified in the SRX
standard was the ICU one.

I might have a user for SRX in C++ - I'd appreciate a link when it
becomes available.

-- 
<Leftmost> jimregan, that's because deep inside you, you are evil.
<Leftmost> Also not-so-deep inside you.

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list