<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head><meta http-equiv=Content-Type content="text/html; charset=iso-2022-jp"><meta name=Generator content="Microsoft Word 14 (filtered medium)"><style><!--
/* Font Definitions */
@font-face
        {font-family:Calibri;
        panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
        {font-family:Tahoma;
        panose-1:2 11 6 4 3 5 4 4 2 4;}
@font-face
        {font-family:"MS PGothic";
        panose-1:2 11 6 0 7 2 5 8 2 4;}
@font-face
        {font-family:"\@MS PGothic";
        panose-1:2 11 6 0 7 2 5 8 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin:0cm;
        margin-bottom:.0001pt;
        font-size:12.0pt;
        font-family:"MS PGothic","sans-serif";
        mso-fareast-language:JA;}
a:link, span.MsoHyperlink
        {mso-style-priority:99;
        color:blue;
        text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
        {mso-style-priority:99;
        color:purple;
        text-decoration:underline;}
p
        {mso-style-priority:99;
        mso-margin-top-alt:auto;
        margin-right:0cm;
        mso-margin-bottom-alt:auto;
        margin-left:0cm;
        font-size:12.0pt;
        font-family:"MS PGothic","sans-serif";
        mso-fareast-language:JA;}
span.EmailStyle18
        {mso-style-type:personal-reply;
        font-family:"Calibri","sans-serif";
        color:#1F497D;}
.MsoChpDefault
        {mso-style-type:export-only;
        font-family:"Calibri","sans-serif";
        mso-fareast-language:EN-US;}
@page WordSection1
        {size:612.0pt 792.0pt;
        margin:72.0pt 72.0pt 72.0pt 72.0pt;}
div.WordSection1
        {page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]--></head><body lang=EN-GB link=blue vlink=purple><div class=WordSection1><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>Can$B!G(Bt you just use iconv (under Cygwin if you$B!G(Bre in Windows)?<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p> </o:p></span></p><p class=MsoNormal><span style='font-size:10.0pt;font-family:"Courier New";color:#1F497D;mso-fareast-language:EN-GB'>Pete Whitelock</span><span style='font-family:"Times New Roman","serif";color:#1F497D;mso-fareast-language:EN-GB'> <br></span><span style='font-size:10.0pt;font-family:"Courier New";color:#1F497D;mso-fareast-language:EN-GB'>Head of Language Engineering, Dictionaries</span><span style='font-family:"Times New Roman","serif";color:#1F497D;mso-fareast-language:EN-GB'> <br></span><span style='font-size:10.0pt;font-family:"Courier New";color:#1F497D;mso-fareast-language:EN-GB'>Reference Department</span><span style='font-family:"Times New Roman","serif";color:#1F497D;mso-fareast-language:EN-GB'> <br></span><span style='font-size:10.0pt;font-family:"Courier New";color:#1F497D;mso-fareast-language:EN-GB'>Academic Division</span><span style='font-family:"Times New Roman","serif";color:#1F497D;mso-fareast-language:EN-GB'> <br></span><span style='font-size:10.0pt;font-family:"Courier New";color:#1F497D;mso-fareast-language:EN-GB'>Oxford University Press</span><span style='font-family:"Times New Roman","serif";color:#1F497D;mso-fareast-language:EN-GB'> </span><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p></o:p></span></p><p class=MsoNormal><b><span lang=EN-US style='font-size:10.0pt;font-family:"Tahoma","sans-serif"'>From:</span></b><span lang=EN-US style='font-size:10.0pt;font-family:"Tahoma","sans-serif"'> corpora-bounces@uib.no [mailto:corpora-bounces@uib.no] <b>On Behalf Of </b>Jeremy Kahn<br><b>Sent:</b> 28 June 2011 17:57<br><b>To:</b> Laurence Anthony<br><b>Cc:</b> corpora@uib.no<br><b>Subject:</b> Re: [Corpora-List] Chasen and Japanese<o:p></o:p></span></p><p class=MsoNormal><o:p> </o:p></p><p>Why not write a conversion adapter from shift-JIS to Unicode and put that conversion in your data pipeline after chasen? (possibly with a unicode-to-shift-JIS adapter upstream as well?)<o:p></o:p></p><p>Much of NLP work is plumbing; this is even a pretty easy piece of plumbing: very little chewing-gum required!<o:p></o:p></p><p>Jeremy<o:p></o:p></p><div><p class=MsoNormal>On Jun 28, 2011 9:50 AM, "Laurence Anthony" <<a href="mailto:anthony0122@gmail.com">anthony0122@gmail.com</a>> wrote:<br>>><br>>> A Japanese user of WordSmith needs help with the Chasen software, which I<br>> understand provides segmentation of the string of characters in Japanese.<br>> Desired output form would be UTF16 for WordSmith.<br>>><br>>> Can anyone advise, please? Is this possible?<br>>><br>>> Mike<br>> <br>> <br>> Hi Mike,<br>> <br>> I think Chasen only outputs to ANSI (SHIFT-JIS here in Japan) or UTF-8.<br>> However, an alternative tool is MeCab, which does offer tentative UTF-16<br>> support.<br>> <br>> You can read about it here (unfortunately everything is in Japanese):<br>> <a href="http://mecab.sourceforge.net">http://mecab.sourceforge.net</a><br>> <br>> Here's a summary of the latest version (dated 2009):<br>> 2009-09-27 MeCab 0.98<br>> UTF16<span lang=JA>$B$N%5%]!<%H(B</span>(<span lang=JA>$B<B83E*(B</span>)<br>> Windows<span lang=JA>$BHG$G$NJ8;z%3!<%IJQ49$K(B</span> MutlByteToWideChar<span lang=JA>$BEy$N(B</span> Native API<span lang=JA>$B$r;H$&$h$&$KJQ99(B</span><br>> <span lang=JA>$B%=!<%9%3!<%I$r(B</span> Google coding style <span lang=JA>$B$KJQ99(B</span><br>> <span lang=JA>$B%U%)!<%^%C%H;XDj$G(B</span> EON (end of N-best) <span lang=JA>$B$NDI2C(B</span> (-S or --eon-format)<br>> Shift-JIS<span lang=JA>$B4D6-$GH>3Q%+%?%+%J$N07$$$KLdBj$,$"$C$?$N$r=$@5(B</span><br>> online learning <span lang=JA>$B$N%5%]!<%H(B</span> (<span lang=JA>$B<B83E*(B</span>)<br>> Wno-deprecated<span lang=JA>$B$r$D$1$J$/$F$b%3%s%Q%$%k$G$-$k$h$&$K$7$?(B</span><br>> <span lang=JA>$B:Y$+$$%P%0$N=$@5(B</span><br>> <br>> Hope that helps!<br>> Laurence.<o:p></o:p></p></div></div>
<P>Oxford University Press (UK) Disclaimer</P>
<P>This message is confidential. You should not copy it or disclose its contents 
to anyone. You may use and apply the information for the intended purpose only. 
OUP does not accept legal responsibility for the contents of this message. Any 
views or opinions presented are those of the author only and not of OUP. If this 
email has come to you in error, please delete it, along with any attachments. 
Please note that OUP may intercept incoming and outgoing email 
communications.</P>
</body></html>