[Corpora-List] Identifying words in Japanese
Brett Reynolds
brett at forsyths.ca
Tue Jun 17 15:18:01 UTC 2003
In Japanese, often words are written in a mixture of two scripts: kanji
(logographs) and hiragana (syllabary). For example, where upper-case
letters indicate kanji, lower-case represent hiragana, and a space
indicates character boundaries, you might find the following word:
HI k KO shi
Unfortunately, anything that's written in kanji can alternatively be
written using hiragana.
hi k ko shi
Further complicating the problem, sometimes hiragana occurring after a
kanji (okurigana) are omitted or assumed.
HIK KOSHI
HI k KOSHI
HIK KO shi
Thus, a word like this can be written five different ways. Given all
this, how would one go about doing a word-frequency count in Japanese?
One option is to standardize everything to hiragana (doable). The
problem with this is that you then end up with a high percentage of
homographic heteronyms (they would be heterographic, were they written
in kanji).
Any other ideas?
And a related question: does anyone have an extensive list of Japanese
transitive / intransitive verb pairs?
-----------------------
Brett Reynolds
Ontario, Canada
brett at forsyths.ca
More information about the Corpora
mailing list