[Corpora-List] Chinese sentence detector or splitter

Simon Smith smithsgj at gmail.com
Mon Apr 22 08:51:27 UTC 2013


Dear Xin Ying Qiu

This doesn't sound like it would to too hard to write a script for, or just
do it in word... Why don't you post an extract from one of your reports,
with a few sentences you do want and numbers/headings that you don't want.

Seems like you could just do it in word by substituting a ^p for all the 。
?!symbols?

Sometimes the Chinese period is mid-line, sometimes at the bottom (like
English punctuation). I'm not sure how to control this or whether they are
different Unicode characters. But that could be why the program you were
using didn't find the periods?

Simon


>    1.  Chinese sentence detector or splitter (Xin Ying Qiu)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Sun, 21 Apr 2013 16:34:45 +0800
> From: Xin Ying Qiu <xinying.qiu at gmail.com>
> Subject: [Corpora-List] Chinese sentence detector or splitter
> To: Corpora at uib.no
>
> Hello,
>
> I am processing Chinese reports which include phrases as title and
> subtitles as well as sentences ending with the period sign.  I want to
> extract the sentences ending with the period sign. But it is difficult to
> identify the beginning of such sentences as the document may contain
> stand-alone phrases and numbers.  It is not a document consisting of only
> sentences ending with period signs.  Are there any tools available to
> detect or split or extract Chinese sentence from a document?
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20130422/6480350b/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list