<div dir="ltr">As far as I know, most approaches specifically for Chinese sentence segmentation focus on segmenting unpunctuated text, e.g.<div style>Huang/Cheng (2011): Pause and Stop Labeling for Chinese Sentence Boundary Detection [1]</div>
<div style>might be what you're looking for, although the F-scores seem a bit low for production use.</div><div style>If you have a large enough corpus, you can always try unsupervised, multilingual algorithms such as</div>
<div style>Kiss/Strunk (2006): Unsupervised Multilingual Sentence Boundary Detection [2]</div><div style>An implementation of it is bundled with NLTK. It has good accuracy for English data, but I'm not sure if it would work well with your material.</div>
<div style><br></div><div style>[1] <a href="http://www.aclweb.org/anthology-new/R/R11/R11-1021.pdf">http://www.aclweb.org/anthology-new/R/R11/R11-1021.pdf</a></div><div class="gmail_extra">[2] <a href="http://www.linguistics.rub.de/~strunk/ks2005FINAL.pdf">http://www.linguistics.rub.de/~strunk/ks2005FINAL.pdf</a><br>
<br><div class="gmail_quote">On Sun, Apr 21, 2013 at 10:34 AM, Xin Ying Qiu <span dir="ltr"><<a href="mailto:xinying.qiu@gmail.com" target="_blank">xinying.qiu@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<div dir="ltr"><div><div><div><div>Hello,<br><br></div>I am processing Chinese reports which include phrases as title and subtitles as well as sentences ending with the period sign. I want to extract the sentences ending with the period sign. But it is difficult to identify the beginning of such sentences as the document may contain stand-alone phrases and numbers. It is not a document consisting of only sentences ending with period signs. Are there any tools available to detect or split or extract Chinese sentence from a document? <br>
<br></div>I've tried Stanford NLP document preprocess tool: edu.stanford.nlp.process.DocumentPreprocessor. But it does not seem to work for my document. <br><br></div>Thank you in advance for any advice and suggestions!<br>
<br></div>Sincerely,<br><br>Xin Ying<br><br></div>
<br>_______________________________________________<br>
UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/corpora</a><br>
Corpora mailing list<br>
<a href="mailto:Corpora@uib.no">Corpora@uib.no</a><br>
<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>
<br></blockquote></div><br></div></div>