[Corpora-List] Identifying words in Japanese

Tue Jun 17 15:18:01 UTC 2003

In Japanese, often words are written in a mixture of two scripts: kanji
(logographs) and hiragana (syllabary).  For example, where upper-case
letters indicate kanji, lower-case represent hiragana, and a space
indicates character boundaries, you might find the following word:

HI k KO shi

Unfortunately, anything that's written in kanji can alternatively be
written using hiragana.

hi k ko shi

Further complicating the problem, sometimes hiragana occurring after a
kanji (okurigana) are omitted or assumed.

HIK KOSHI
HI k KOSHI
HIK KO shi

Thus, a word like this can be written five different ways. Given all
this, how would one go about doing a word-frequency count in Japanese?
One option is to standardize everything to hiragana (doable). The
problem with this is that you then end up with a high percentage of
homographic heteronyms (they would be heterographic, were they written
in kanji).

Any other ideas?

And a related question: does anyone have an extensive list of Japanese
transitive / intransitive verb pairs?

-----------------------
Brett Reynolds
Ontario, Canada
brett at forsyths.ca