[Corpora-List] Sentence segmenting

Adam Radziszewski kocikikut at gmail.com
Tue Aug 14 07:51:06 UTC 2012


On 14 August 2012 00:29, Marcin Miłkowski <list-address at wp.pl> wrote:

> Hi Jeff,
>
> if you want to reuse translator's resources (and computer-aided
> translation tools need to have text segmented into sentences), you can use
> SRX standard. I have authored some rules for English, though they are not
> perfect (I have a much better set of rules for Polish). The open-source
> library that supports SRX, segment, is also pretty fast.
>

In case you're interested in using SRX rules, you may also consider trying
our C++ implementation
<http://nlp.pwr.wroc.pl/redmine/projects/toki/wiki/>(GNU LGPL). The
processing speed in terms of tokens per sec is similar to
Marcin Miłkowski's Java segment tool, but if many short texts are to be
processed it might be convenient to get rid of Java VM start-up time.

Best,
Adam Radziszewski
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20120814/63a93be6/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list