[Corpora-List] How to do Japanese word segmentation using extra term list?

Michal Ptaszynski michal.ptaszynski at gmail.com
Tue Oct 25 16:36:51 UTC 2011


Dear Adam, Dear Hongfei Jiang

Here is a page with short mecab manual for Linux users.

http://linux.die.net/man/1/mecab

If you just need to add spaces (is this what you mean by segmentation?) to
a document in Japanese, just type:
mecab -Owakati input_file.txt > output_file.txt

this will give you a spaced version of your file using the standard
dictionary (usually it is ipadic).

You can also use a custom dictionary with your extra term list (check the
page above).

Best,
--
Michal PTASZYNSKI
Institute of Engineering, Hokkai-Gakuen University
High-Tech Research Center, Intelligent Techniques Laboratory 6,
Minami 26, Nishi 11, Chuo-ku, Sapporo, 064-0926, Japan
ptaszynski at hgu.jp, ptaszynski at ieee.org
TEL: +81-11-841-1161 (ext.: 7796), FAX: +81-11-551-2951
http://arakilab.media.eng.hokudai.ac.jp/~ptaszynski/

--------------
Od: Adam Kilgarriff <adam at lexmasterclass.com>
Kopia dla: corpora <corpora at uib.no>, Hiram Calvo <hiramcalvo at gmail.com>,
Jan Pomikálek <xpomikal at fi.muni.cz>
Do: "hf.jiang" <hf.jiang at gmail.com>
Data: Thu, 20 Oct 2011 08:22:02 +0100
Temat: Re: [Corpora-List] How to do Japanese word segmentation using extra
term list?

>  However, since almost of the user manual is in Japanese, I can not  
> understand completely.

We have the same problem; are there any English versions anywhere
(specially for mecab).  Pointers and advice appreciated

Adam

On 20 October 2011 08:08, hf.jiang <hf.jiang at gmail.com> wrote:
Hi,all

       I'm currently trying to process Japanese texts.
       Some friends suggest me use Chasen or Mecab.
       However, since almost of the user manual is in Japanese, I can not
understand completely.
       My expectation is that the segmentation tool can recognize the words
preferred to my term list.

       Note that I have not enough gold data for the training of the tools,
so, the off-the-shelf tool is better for me.

       Looking forward to your reply, thanks.

-Hongfei Jiang

_______________________________________________
    UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
    Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora




-- 
========================================
Adam Kilgarriff                  adam at lexmasterclass.com
    Director                                    Lexical Computing Ltd
Visiting Research Fellow                 University of Leeds
Corpora for all with the Sketch Engine
                           DANTE: a lexical database for English
    ========================================

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list