<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html; charset=ISO-2022-JP"
http-equiv="Content-Type">
<title></title>
</head>
<body bgcolor="#ffffff" text="#000000">
Depends who's doing the plumbing, really. In WordSmith there is a
converter which will do that already, but Chasen is a different
piece of kit and I think you need a Japanese plumber for that.!<br>
Mike<br>
<br>
On 28/06/2011 17:57, Jeremy Kahn wrote:
<blockquote
cite="mid:BANLkTinhW_COiu6wowgx4L6bJeBkk4UknA@mail.gmail.com"
type="cite">
<p>Why not write a conversion adapter from shift-JIS to Unicode
and put that conversion in your data pipeline after chasen?
(possibly with a unicode-to-shift-JIS adapter upstream as well?)</p>
<p>Much of NLP work is plumbing; this is even a pretty easy piece
of plumbing: very little chewing-gum required!</p>
<p>Jeremy<br>
</p>
<div class="gmail_quote">On Jun 28, 2011 9:50 AM, "Laurence
Anthony" <<a moz-do-not-send="true"
href="mailto:anthony0122@gmail.com">anthony0122@gmail.com</a>>
wrote:<br type="attribution">
>><br>
>> A Japanese user of WordSmith needs help with the Chasen
software, which I<br>
> understand provides segmentation of the string of
characters in Japanese.<br>
> Desired output form would be UTF16 for WordSmith.<br>
>><br>
>> Can anyone advise, please? Is this possible?<br>
>><br>
>> Mike<br>
> <br>
> <br>
> Hi Mike,<br>
> <br>
> I think Chasen only outputs to ANSI (SHIFT-JIS here in
Japan) or UTF-8.<br>
> However, an alternative tool is MeCab, which does offer
tentative UTF-16<br>
> support.<br>
> <br>
> You can read about it here (unfortunately everything is in
Japanese):<br>
> <a moz-do-not-send="true"
href="http://mecab.sourceforge.net">http://mecab.sourceforge.net</a><br>
> <br>
> Here's a summary of the latest version (dated 2009):<br>
> 2009-09-27 MeCab 0.98<br>
> UTF16$B$N%5%]!<%H(B($B<B83E*(B)<br>
> Windows$BHG$G$NJ8;z%3!<%IJQ49$K(B MutlByteToWideChar$BEy$N(B Native API$B$r;H$&$h$&$KJQ99(B<br>
> $B%=!<%9%3!<%I$r(B Google coding style $B$KJQ99(B<br>
> $B%U%)!<%^%C%H;XDj$G(B EON (end of N-best) $B$NDI2C(B (-S or --eon-format)<br>
> Shift-JIS$B4D6-$GH>3Q%+%?%+%J$N07$$$KLdBj$,$"$C$?$N$r=$@5(B<br>
> online learning $B$N%5%]!<%H(B ($B<B83E*(B)<br>
> Wno-deprecated$B$r$D$1$J$/$F$b%3%s%Q%$%k$G$-$k$h$&$K$7$?(B<br>
> $B:Y$+$$%P%0$N=$@5(B<br>
> <br>
> Hope that helps!<br>
> Laurence.<br>
</div>
</blockquote>
<br>
<pre class="moz-signature" cols="72">--
Mike Scott
***
If you publish research which uses WordSmith, do let me know so I can include it at
<a class="moz-txt-link-freetext" href="http://www.lexically.net/wordsmith/corpus_linguistics_links/papers_using_wordsmith.htm">http://www.lexically.net/wordsmith/corpus_linguistics_links/papers_using_wordsmith.htm</a>
***
University of Aston and Lexical Analysis Software Ltd.
<a class="moz-txt-link-abbreviated" href="mailto:mike.scott@aston.ac.uk">mike.scott@aston.ac.uk</a>
<a class="moz-txt-link-abbreviated" href="http://www.lexically.net">www.lexically.net</a>
</pre>
</body>
</html>