[Corpora-List] Chasen and Japanese

Mike Scott mike at lexically.net
Tue Jun 28 17:01:38 UTC 2011


Depends who's doing the plumbing, really. In WordSmith there is a
converter which will do that already, but Chasen is a different piece of
kit and I think you need a Japanese plumber for that.!
Mike

On 28/06/2011 17:57, Jeremy Kahn wrote:
>
> Why not write a conversion adapter from shift-JIS to Unicode and put
> that conversion in your data pipeline after chasen? (possibly with a
> unicode-to-shift-JIS adapter upstream as well?)
>
> Much of NLP work is plumbing; this is even a pretty easy piece of
> plumbing: very little chewing-gum required!
>
> Jeremy
>
> On Jun 28, 2011 9:50 AM, "Laurence Anthony" <anthony0122 at gmail.com
> <mailto:anthony0122 at gmail.com>> wrote:
> >>
> >> A Japanese user of WordSmith needs help with the Chasen software,
> which I
> > understand provides segmentation of the string of characters in
> Japanese.
> > Desired output form would be UTF16 for WordSmith.
> >>
> >> Can anyone advise, please? Is this possible?
> >>
> >> Mike
> >
> >
> > Hi Mike,
> >
> > I think Chasen only outputs to ANSI (SHIFT-JIS here in Japan) or UTF-8.
> > However, an alternative tool is MeCab, which does offer tentative UTF-16
> > support.
> >
> > You can read about it here (unfortunately everything is in Japanese):
> > http://mecab.sourceforge.net
> >
> > Here's a summary of the latest version (dated 2009):
> > 2009-09-27 MeCab 0.98
> > UTF16のサポート(実験的)
> > Windows版での文字コード変換に MutlByteToWideChar等の Native APIを使
> うように変更
> > ソースコードを Google coding style に変更
> > フォーマット指定で EON (end of N-best) の追加 (-S or --eon-format)
> > Shift-JIS環境で半角カタカナの扱いに問題があったのを修正
> > online learning のサポート (実験的)
> > Wno-deprecatedをつけなくてもコンパイルできるようにした
> > 細かいバグの修正
> >
> > Hope that helps!
> > Laurence.

-- 
Mike Scott

***
If you publish research which uses WordSmith, do let me know so I can include it at
http://www.lexically.net/wordsmith/corpus_linguistics_links/papers_using_wordsmith.htm
***
University of Aston and Lexical Analysis Software Ltd.
mike.scott at aston.ac.uk
www.lexically.net

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110628/63cc39ef/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list