[Corpora-List] How to do Japanese word segmentation using extra term list?
Michal Ptaszynski
ptaszynski at media.eng.hokudai.ac.jp
Tue Oct 25 15:27:14 UTC 2011
Dear Adam, Dear Hongfei Jiang
Here is a page with short mecab manual for Linux users.
http://linux.die.net/man/1/mecab
If you just need to add spaces (is this what you mean by segmentation?) to
a document in Japanese, just type:
mecab -Owakati input_file.txt > output_file.txt
this will give you a spaced version of your file using the standard
dictionary (usually it is ipadic).
You can also use a custom dictionary with your extra term list (check the
page above).
Best,
--
Michal PTASZYNSKI
Institute of Engineering, Hokkai-Gakuen University
High-Tech Research Center, Intelligent Techniques Laboratory 6,
Minami 26, Nishi 11, Chuo-ku, Sapporo, 064-0926, Japan
ptaszynski at hgu.jp, ptaszynski at ieee.org
TEL: +81-11-841-1161 (ext.: 7796), FAX: +81-11-551-2951
http://arakilab.media.eng.hokudai.ac.jp/~ptaszynski/
--------------
Od: Adam Kilgarriff <adam at lexmasterclass.com>
Kopia dla: corpora <corpora at uib.no>, Hiram Calvo <hiramcalvo at gmail.com>,
Jan Pomikálek <xpomikal at fi.muni.cz>
Do: "hf.jiang" <hf.jiang at gmail.com>
Data: Thu, 20 Oct 2011 08:22:02 +0100
Temat: Re: [Corpora-List] How to do Japanese word segmentation using extra
term list?
> However, since almost of the user manual is in Japanese, I can not
> understand completely.
We have the same problem; are there any English versions anywhere
(specially for mecab). Pointers and advice appreciated
Adam
On 20 October 2011 08:08, hf.jiang <hf.jiang at gmail.com> wrote:
Hi,all
I'm currently trying to process Japanese texts.
Some friends suggest me use Chasen or Mecab.
However, since almost of the user manual is in Japanese, I can not
understand completely.
My expectation is that the segmentation tool can recognize the words
preferred to my term list.
Note that I have not enough gold data for the training of the tools,
so, the off-the-shelf tool is better for me.
Looking forward to your reply, thanks.
-Hongfei Jiang
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
--
========================================
Adam Kilgarriff adam at lexmasterclass.com
Director Lexical Computing Ltd
Visiting Research Fellow University of Leeds
Corpora for all with the Sketch Engine
DANTE: a lexical database for English
========================================
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list