[Corpora-List] Chasen and Japanese

Jeremy Kahn trochee at trochee.net
Tue Jun 28 16:57:03 UTC 2011


Why not write a conversion adapter from shift-JIS to Unicode and put that
conversion in your data pipeline after chasen? (possibly with a
unicode-to-shift-JIS adapter upstream as well?)

Much of NLP work is plumbing; this is even a pretty easy piece of plumbing:
very little chewing-gum required!

Jeremy
 On Jun 28, 2011 9:50 AM, "Laurence Anthony" <anthony0122 at gmail.com> wrote:
>>
>> A Japanese user of WordSmith needs help with the Chasen software, which I
> understand provides segmentation of the string of characters in Japanese.
> Desired output form would be UTF16 for WordSmith.
>>
>> Can anyone advise, please? Is this possible?
>>
>> Mike
>
>
> Hi Mike,
>
> I think Chasen only outputs to ANSI (SHIFT-JIS here in Japan) or UTF-8.
> However, an alternative tool is MeCab, which does offer tentative UTF-16
> support.
>
> You can read about it here (unfortunately everything is in Japanese):
> http://mecab.sourceforge.net
>
> Here's a summary of the latest version (dated 2009):
> 2009-09-27 MeCab 0.98
> UTF16のサポート(実験的)
> Windows版での文字コード変換に MutlByteToWideChar等の Native APIを使うように変更
> ソースコードを Google coding style に変更
> フォーマット指定で EON (end of N-best) の追加 (-S or --eon-format)
> Shift-JIS環境で半角カタカナの扱いに問題があったのを修正
> online learning のサポート (実験的)
> Wno-deprecatedをつけなくてもコンパイルできるようにした
> 細かいバグの修正
>
> Hope that helps!
> Laurence.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110628/dcd6efff/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list