[Corpora-List] Chasen and Japanese

WHITELOCK, Pete pete.whitelock at oup.com
Wed Jun 29 13:21:59 UTC 2011


Can’t you just use iconv (under Cygwin if you’re in Windows)?

Pete Whitelock
Head of Language Engineering, Dictionaries
Reference Department
Academic Division
Oxford University Press
From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of Jeremy Kahn
Sent: 28 June 2011 17:57
To: Laurence Anthony
Cc: corpora at uib.no
Subject: Re: [Corpora-List] Chasen and Japanese


Why not write a conversion adapter from shift-JIS to Unicode and put that conversion in your data pipeline after chasen? (possibly with a unicode-to-shift-JIS adapter upstream as well?)

Much of NLP work is plumbing; this is even a pretty easy piece of plumbing: very little chewing-gum required!

Jeremy
On Jun 28, 2011 9:50 AM, "Laurence Anthony" <anthony0122 at gmail.com<mailto:anthony0122 at gmail.com>> wrote:
>>
>> A Japanese user of WordSmith needs help with the Chasen software, which I
> understand provides segmentation of the string of characters in Japanese.
> Desired output form would be UTF16 for WordSmith.
>>
>> Can anyone advise, please? Is this possible?
>>
>> Mike
>
>
> Hi Mike,
>
> I think Chasen only outputs to ANSI (SHIFT-JIS here in Japan) or UTF-8.
> However, an alternative tool is MeCab, which does offer tentative UTF-16
> support.
>
> You can read about it here (unfortunately everything is in Japanese):
> http://mecab.sourceforge.net
>
> Here's a summary of the latest version (dated 2009):
> 2009-09-27 MeCab 0.98
> UTF16のサポート(実験的)
> Windows版での文字コード変換に MutlByteToWideChar等の Native APIを使うように変更
> ソースコードを Google coding style に変更
> フォーマット指定で EON (end of N-best) の追加 (-S or --eon-format)
> Shift-JIS環境で半角カタカナの扱いに問題があったのを修正
> online learning のサポート (実験的)
> Wno-deprecatedをつけなくてもコンパイルできるようにした
> 細かいバグの修正
>
> Hope that helps!
> Laurence.

Oxford University Press (UK) Disclaimer

This message is confidential. You should not copy it or disclose its contents to anyone. You may use and apply the information for the intended purpose only. OUP does not accept legal responsibility for the contents of this message. Any views or opinions presented are those of the author only and not of OUP. If this email has come to you in error, please delete it, along with any attachments. Please note that OUP may intercept incoming and outgoing email communications.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110629/25e95c72/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list