[Corpora-List] sentence boundary detectors

Victor Kapustin victor.kapustin at gmail.com
Sun Feb 18 13:59:56 UTC 2007


Armin,

> I was wondering if you could point me to good sentence
> splitters for the
> following languages: German, Russian
For Russian:

http://aot.ru/download/graphan.tar.gz (source in C++, dll is included in 
http://aot.ru/download/shortrml.zip).

For most purposes I use a regexp (in javascript, conversion to Perl/Python is 
straightforward):

var _DELIMS_OPEN_RAW_ = '(["</' ;
var _DELIMS_OPEN_ = '\\'+_DELIMS_OPEN_RAW_.split('').join('\\') ;
var sentenceSplitter = new RegExp( 
'(?:\\.|\\!|\\?)+\\s+(?=['+_DELIMS_OPEN_+']?[А-ЯЁA-Z])' ) ;

--
Victor Kapustin



More information about the Corpora mailing list