Incorrect word segmenting for Chinese characters

Toh An popsune1 at gmail.com
Tue Dec 13 09:20:10 UTC 2016



Hi, I have encountered a problem with Chinese data. Clan does not appear to 
segment Chinese sentences into word tokens correctly. Part of speech 
tagging is also affected. Attached is the clan output after running mlu and 
freq commands on a test file without mor tier (TestFileOutput), and the 
same test file with mor tier added (TestFileMor). Does anyone have any 
ideas how to resolve this? Thanks.

-- 
You received this message because you are subscribed to the Google Groups "chibolts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to chibolts+unsubscribe at googlegroups.com.
To post to this group, send email to chibolts at googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/chibolts/f1080fe4-4646-4e24-bd2f-7b6b753d7c54%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/chibolts/attachments/20161213/b89dea31/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: TestFile.cha
Type: application/octet-stream
Size: 726 bytes
Desc: not available
URL: <http://listserv.linguistlist.org/pipermail/chibolts/attachments/20161213/b89dea31/attachment-0006.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: TestFileOutput.cha
Type: application/octet-stream
Size: 1264 bytes
Desc: not available
URL: <http://listserv.linguistlist.org/pipermail/chibolts/attachments/20161213/b89dea31/attachment-0007.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: TestFileMor.cha
Type: application/octet-stream
Size: 827 bytes
Desc: not available
URL: <http://listserv.linguistlist.org/pipermail/chibolts/attachments/20161213/b89dea31/attachment-0008.obj>


More information about the Chibolts mailing list