[Corpora-List] Chasen and Japanese

Laurence Anthony anthony0122 at gmail.com
Tue Jun 28 16:42:57 UTC 2011


>
> A Japanese user of WordSmith needs help with the Chasen software, which I
understand provides segmentation of the string of characters in Japanese.
Desired output form would be UTF16 for WordSmith.
>
> Can anyone advise, please? Is this possible?
>
> Mike


Hi Mike,

I think Chasen only outputs to ANSI (SHIFT-JIS here in Japan) or UTF-8.
However, an alternative tool is MeCab, which does offer tentative UTF-16
support.

You can read about it here (unfortunately everything is in Japanese):
http://mecab.sourceforge.net

Here's a summary of the latest version (dated 2009):
2009-09-27 MeCab 0.98
UTF16のサポート(実験的)
Windows版での文字コード変換に MutlByteToWideChar等の Native APIを使うように変更
ソースコードを Google coding style に変更
フォーマット指定で EON (end of N-best) の追加 (-S or --eon-format)
Shift-JIS環境で半角カタカナの扱いに問題があったのを修正
online learning のサポート (実験的)
Wno-deprecatedをつけなくてもコンパイルできるようにした
細かいバグの修正

Hope that helps!
Laurence.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110629/66282832/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list