[Corpora-List] Chasen and Japanese

Wed Jun 29 13:52:43 UTC 2011

On 28 June 2011 08:57, Mike Scott <mike at lexically.net> wrote:
> A Japanese user of WordSmith needs help with the Chasen software, which I
> understand provides segmentation of the string of characters in Japanese.
> Desired output form would be UTF16 for WordSmith.

Note that NLTK includes ChaSen, and that Python supports encoding and
decoding of utf8 and utf16, so a small Python program could do the job
of running ChaSen and saving the results in utf16.

The Japanese translation of the NLTK book covers Unicode in chapter 3,
and ChaSen in an additional chapter on Japanese language processing at
the end of the book.

http://www.oreilly.co.jp/books/9784873114705/

-Steven Bird

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora