[Corpora-List] Chinese sentence detector or splitter

Florian Petran florian.petran at gmail.com
Sun Apr 21 13:55:37 UTC 2013


As far as I know, most approaches specifically for Chinese sentence
segmentation focus on segmenting unpunctuated text, e.g.
Huang/Cheng (2011): Pause and Stop Labeling for Chinese Sentence Boundary
Detection [1]
might be what you're looking for, although the F-scores seem a bit low for
production use.
If you have a large enough corpus, you can always try unsupervised,
multilingual algorithms such as
Kiss/Strunk (2006): Unsupervised Multilingual Sentence Boundary Detection
[2]
An implementation of it is bundled with NLTK. It has good accuracy for
English data, but I'm not sure if it would work well with your material.

[1] http://www.aclweb.org/anthology-new/R/R11/R11-1021.pdf
[2] http://www.linguistics.rub.de/~strunk/ks2005FINAL.pdf

On Sun, Apr 21, 2013 at 10:34 AM, Xin Ying Qiu <xinying.qiu at gmail.com>wrote:

> Hello,
>
> I am processing Chinese reports which include phrases as title and
> subtitles as well as sentences ending with the period sign.  I want to
> extract the sentences ending with the period sign. But it is difficult to
> identify the beginning of such sentences as the document may contain
> stand-alone phrases and numbers.  It is not a document consisting of only
> sentences ending with period signs.  Are there any tools available to
> detect or split or extract Chinese sentence from a document?
>
> I've tried Stanford NLP document preprocess tool:
> edu.stanford.nlp.process.DocumentPreprocessor.  But it does not seem to
> work for my document.
>
> Thank you in advance for any advice and suggestions!
>
> Sincerely,
>
> Xin Ying
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20130421/f14bcbf2/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list