[Corpora-List] How to do Japanese word segmentation using extra term list?

Pierre Marchal pierre.inalco at gmail.com
Thu Oct 20 10:41:33 UTC 2011


Hi,

Your best bet is to build a dictionary containing your terms.
Just do as follow :

- create a foo.csv file, with one entry per line (no blank line)
eg :  自然言語処理,-1,-1,10,名詞,一般,*,*,*,*,自然言語処理,シゼンゲンゴショリ,シゼンゲンゴショリ
(note that you can add extra arguments like translation, comments, and so
on)
(the '10' value is the cost of your entry the lower it is, the easier it is
to be recognized when parsing text)

- compile the dictionary. on my computer it goes :
/usr/lib/mecab/mecab-dict-index -d /usr/share/mecab/dic/naist-jdic-eucjp/ -u
foo.dic -f utf-8 -t utf-8 foo.csv
you must provide an existing dictionary, and your .csv file
other arguments are : the dictionary file to be created (foo.dic), the
encoding of your .csv file (-f utf-8), the encoding of the dictionary (-t
utf-8)

- run mecab
mecab -u foo.dic

Best luck,

pm


PS :
In case you have queries you can contact me directly (
pierre[dot]marchal[at]inalco[dot]fr )

-- 
Pierre Marchal
ERTIM - INaLCO
49 bis avenue de la Belle Gabrielle
F-75012 PARIS
+33 1 80 51 95 21
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20111020/0b16f598/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list