Hi Jeff,<br><br>You might like to try our standalone English sentence segmenter, which can be downloaded at <a href="http://www.fleric.org.cn/pub/ss.rar">http://www.fleric.org.cn/pub/ss.rar</a><br><br>Jiajin<br><br>Jiajin Xu<br>

PhD, associate professor<br>National Research Centre for Foreign Language Education<br>Beijing Foreign Studies University<br><br><div class="gmail_quote">On Tue, Aug 14, 2012 at 6:29 AM, Marcin Miłkowski <span dir="ltr"><<a href="mailto:list-address@wp.pl" target="_blank">list-address@wp.pl</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi Jeff,<br>

<br>

if you want to reuse translator's resources (and computer-aided translation tools need to have text segmented into sentences), you can use SRX standard. I have authored some rules for English, though they are not perfect (I have a much better set of rules for Polish). The open-source library that supports SRX, segment, is also pretty fast.<br>


<br>

The paper is here:<br>

<br>

<a href="http://marcinmilkowski.pl/downloads/ltc-043-milkowski.pdf" target="_blank">http://marcinmilkowski.pl/<u></u>downloads/ltc-043-milkowski.<u></u>pdf</a><br>

<br>

The rules are here:<br>

<br>

<a href="http://languagetool.svn.sourceforge.net/viewvc/languagetool/trunk/JLanguageTool/src/resource/segment.srx?revision=7751" target="_blank">http://languagetool.svn.<u></u>sourceforge.net/viewvc/<u></u>languagetool/trunk/<u></u>JLanguageTool/src/resource/<u></u>segment.srx?revision=7751</a><br>


<br>

Regards,<br>

Marcin<br>

<br>

W dniu 2012-08-13 22:20, Sebastian Nagel pisze:<div class="HOEnZb"><div class="h5"><br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Hi Jeff,<br>

<br>

two years ago there was an exhaustive summary of a similar request:<br>

<a href="http://mailman.uib.no/public/corpora/2010-August/011367.html" target="_blank">http://mailman.uib.no/public/<u></u>corpora/2010-August/011367.<u></u>html</a><br>

<br>

But check the list archives (or Google) for<br>

"sentence (splitt(er|ing)|boundar(y|ies)<u></u>|detector)" or similar.<br>

There have been a couple of threads during the last years.<br>

<br>

Regards,<br>

Sebastian<br>

<br>

On 08/13/2012 03:35 PM, Jeff Elmore wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

I'm curious what folks are using these days for sentence segmenting for<br>

English.<br>

<br>

My application involves narrative and informational texts at a variety of<br>

reading levels and genres. Most text is hand-edited to eliminate non-prose<br>

content but any system that could respond robustly to unedited text would<br>

be awesome, of course.<br>

<br>

Mostly we've been using hand-crafted tools written in Python. I have<br>

checked out what NLTK offers but from what I've seen there's not anything<br>

terribly accurate in it (fails on obvious common cases like some<br>

honorifics). We did develop a decision tree based model using Weka for<br>

Spanish text. I'd be happy to do this again for English but wanted to see<br>

if there's something good already out there.<br>

<br>

Thanks in advance!<br>

<br>

<br>

<br>

______________________________<u></u>_________________<br>

UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/<u></u>corpora</a><br>

Corpora mailing list<br>

<a href="mailto:Corpora@uib.no" target="_blank">Corpora@uib.no</a><br>

<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/<u></u>listinfo/corpora</a><br>

<br>

</blockquote>

<br>

<br>

______________________________<u></u>_________________<br>

UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/<u></u>corpora</a><br>

Corpora mailing list<br>

<a href="mailto:Corpora@uib.no" target="_blank">Corpora@uib.no</a><br>

<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/<u></u>listinfo/corpora</a><br>

<br>

<br>

</blockquote>

<br>

<br>

______________________________<u></u>_________________<br>

UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/<u></u>corpora</a><br>

Corpora mailing list<br>

<a href="mailto:Corpora@uib.no" target="_blank">Corpora@uib.no</a><br>

<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/<u></u>listinfo/corpora</a><br>

</div></div></blockquote></div><br>