Hi Jeff,<br><br>You might like to try our standalone English sentence segmenter, which can be downloaded at <a href="http://www.fleric.org.cn/pub/ss.rar">http://www.fleric.org.cn/pub/ss.rar</a><br><br>Jiajin<br><br>Jiajin Xu<br>
PhD, associate professor<br>National Research Centre for Foreign Language Education<br>Beijing Foreign Studies University<br><br><div class="gmail_quote">On Tue, Aug 14, 2012 at 6:29 AM, Marcin Miłkowski <span dir="ltr"><<a href="mailto:list-address@wp.pl" target="_blank">list-address@wp.pl</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi Jeff,<br>
<br>
if you want to reuse translator's resources (and computer-aided translation tools need to have text segmented into sentences), you can use SRX standard. I have authored some rules for English, though they are not perfect (I have a much better set of rules for Polish). The open-source library that supports SRX, segment, is also pretty fast.<br>
<br>
The paper is here:<br>
<br>
<a href="http://marcinmilkowski.pl/downloads/ltc-043-milkowski.pdf" target="_blank">http://marcinmilkowski.pl/<u></u>downloads/ltc-043-milkowski.<u></u>pdf</a><br>
<br>
The rules are here:<br>
<br>
<a href="http://languagetool.svn.sourceforge.net/viewvc/languagetool/trunk/JLanguageTool/src/resource/segment.srx?revision=7751" target="_blank">http://languagetool.svn.<u></u>sourceforge.net/viewvc/<u></u>languagetool/trunk/<u></u>JLanguageTool/src/resource/<u></u>segment.srx?revision=7751</a><br>
<br>
Regards,<br>
Marcin<br>
<br>
W dniu 2012-08-13 22:20, Sebastian Nagel pisze:<div class="HOEnZb"><div class="h5"><br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Hi Jeff,<br>
<br>
two years ago there was an exhaustive summary of a similar request:<br>
<a href="http://mailman.uib.no/public/corpora/2010-August/011367.html" target="_blank">http://mailman.uib.no/public/<u></u>corpora/2010-August/011367.<u></u>html</a><br>
<br>
But check the list archives (or Google) for<br>
"sentence (splitt(er|ing)|boundar(y|ies)<u></u>|detector)" or similar.<br>
There have been a couple of threads during the last years.<br>
<br>
Regards,<br>
Sebastian<br>
<br>
On 08/13/2012 03:35 PM, Jeff Elmore wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
I'm curious what folks are using these days for sentence segmenting for<br>
English.<br>
<br>
My application involves narrative and informational texts at a variety of<br>
reading levels and genres. Most text is hand-edited to eliminate non-prose<br>
content but any system that could respond robustly to unedited text would<br>
be awesome, of course.<br>
<br>
Mostly we've been using hand-crafted tools written in Python. I have<br>
checked out what NLTK offers but from what I've seen there's not anything<br>
terribly accurate in it (fails on obvious common cases like some<br>
honorifics). We did develop a decision tree based model using Weka for<br>
Spanish text. I'd be happy to do this again for English but wanted to see<br>
if there's something good already out there.<br>
<br>
Thanks in advance!<br>
<br>
<br>
<br>
______________________________<u></u>_________________<br>
UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/<u></u>corpora</a><br>
Corpora mailing list<br>
<a href="mailto:Corpora@uib.no" target="_blank">Corpora@uib.no</a><br>
<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/<u></u>listinfo/corpora</a><br>
<br>
</blockquote>
<br>
<br>
______________________________<u></u>_________________<br>
UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/<u></u>corpora</a><br>
Corpora mailing list<br>
<a href="mailto:Corpora@uib.no" target="_blank">Corpora@uib.no</a><br>
<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/<u></u>listinfo/corpora</a><br>
<br>
<br>
</blockquote>
<br>
<br>
______________________________<u></u>_________________<br>
UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/<u></u>corpora</a><br>
Corpora mailing list<br>
<a href="mailto:Corpora@uib.no" target="_blank">Corpora@uib.no</a><br>
<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/<u></u>listinfo/corpora</a><br>
</div></div></blockquote></div><br>