[Corpora-List] identifying words in Japanese
Christoph Neumann
neumann at nova.co.jp
Wed Jun 18 02:40:36 UTC 2003
Good morning from Tokyo,.
To my mind, the variation in the writing system is not such a big
problem for word frequency count. One has to combine several approaches,
though.
Most Japanese words have a canonical way of the choice of whether they
are written with hiragana or kanji, or, with "kun-yomi" verbs,
adjectives and their derived nouns, a canonical combination of kanji and
okurigana. This standard "orthograph" is automatically suggested by
Japanese word processors, a good check way to check for it if you are
not sure.
Variation in the mix of kanji/hiragana seems to occur only in a
restricted part of the vocabulary, namely with compound verbs and their
derived nouns (like in the hikkoshi-example). Even there, there is
normally one dominant preference (word processor!). As the general
pattern is always Kanji-Hiragana-Kanji-Hiragana, one might think of a
dynamic solution of identifying all variations.
Only (and fortunately) very frequent words seem to have real
(unpredictable) variation like "watashi" ("I") . As those words are
limited in number, one can account for them beforehand by explicitly
defining several variation sets, or simply add up their scores manually,
having a look at the top 100 or so ranking words.
Christoph Neumann
Brett Reynolds wrote:
> In Japanese, often words are written in a mixture of two scripts:
> kanji (logographs) and hiragana (syllabary). For example, where
> upper-case letters indicate kanji, lower-case represent hiragana, and
> a space indicates character boundaries, you might find the following
> word:
>
> HI k KO shi
>
> Unfortunately, anything that's written in kanji can alternatively be
> written using hiragana.
>
> hi k ko shi
>
> Further complicating the problem, sometimes hiragana occurring after a
> kanji (okurigana) are omitted or assumed.
>
> HIK KOSHI
> HI k KOSHI
> HIK KO shi
>
> Thus, a word like this can be written five different ways. Given all
> this, how would one go about doing a word-frequency count in Japanese?
> One option is to standardize everything to hiragana (doable). The
> problem with this is that you then end up with a high percentage of
> homographic heteronyms (they would be heterographic, were they written
> in kanji).
>
--
Dr. Christoph Neumann neumann at crosslanguage.co.jp
R&D MT, CrossLanguage KK
Tokyo, Japan
http://www.crosslanguage.co.jp/english/index.html
More information about the Corpora
mailing list