Corpora: Neologisms in Japanese

Masaaki NAGATA nagata at nttnly.isl.ntt.co.jp
Wed Apr 18 09:36:19 UTC 2001


> I have been trying out in vain to find statistical data or literature (in
> Japanese or other languages) on the following topic:
>
> What is the percentage of words written in katakana or kanji respectively
> among Japanese neologisms since 1945 and especially during the last decade
> (1991-2000)?

I once computed the relative frequencies of each character type in EDR
corpus classified by the various sources. The result is attached at
the end of this mail.

EDR corpus is one of the largest annotated Japanese corpus.  Its
English description is available at http://www.iijnet.or.jp/edr/.

My gut feeling is that the proportion of kanji and katakana in
Japanese greatly depends on the topic of the text. If the topic is
relatively new or Western-origin things, such as computer science,
there are a lot of katakana words.

The proportion of kanji/katakana and hiragana is also related to the
difficulty of the text. The more hiragana words are used, the more
plain the text is. So the text books for children has more hiragana
words than newspapers.

> What is the largest corpus of Japanese words and proper names that can be
> accessed online?

If you are looking for a free corpus and a free dictionary, I think
IPAL dictionary and IPAL corpus are the largest ones. There is a
Japanese description in the following URL.

http://www.ipa.go.jp/STC/NIHONGO/IPAL/ipal.html

If you can read Japanese, there is a comprehensive list of language
resources at Prof. Matsumoto's labs at Nara Advanced Institute of
Science and Technology, which is written in Japanese, unfortunately.

http://cactus.aist-nara.ac.jp/lab/resource/resource.html

Web translation services such as the following might help you a little
bit.

http://sangenjaya.arc.net.my/index-e.html

-Masaaki

-----
Masaaki NAGATA, NTT Cyber Space Laboratories
1-1 Hikarinooka Yokosuka-Shi Kanagawa 239-0847 Japan
Email: nagata at nttnly.isl.ntt.co.jp Tel: +81-468-59-2796 Fax: +81-468-59-4758

----------------------------------------------------------------------
                         alpha  hira   kan    kata   num    sym
Aera (magazine)          0.003  0.463  0.354  0.080  0.025  0.076
Iwanami Info. Sci. Dict. 0.020  0.401  0.372  0.130  0.005  0.072
Magazines                0.042  0.387  0.255  0.196  0.030  0.090
Asahi newspaper          0.002  0.456  0.391  0.059  0.022  0.069
Nikkei newspaper         0.001  0.512  0.369  0.051  0.000  0.067
Heibonsha encyclopedia	 0.003  0.443  0.403  0.080  0.008  0.062
Sentence Examples	 0.000  0.603  0.293  0.027  0.000  0.077


Aera: 49589 sentences
Monthly magazine published by Asahi Shinbun (newspaper). Like News Week.

Iwanami Information Science Dictionary: 13578
Computer science dictionary published by Iwanami Shoten (publisher).

Magazines: 21199 sentences
Miscelaneous collection of magazines

Asahi newspaper: 91400 sentences
One of the most popular national newspaper in Japan

Nikkei newspaper: 5018 sentences
One of the most popular economic newspaper. Like Wall Street Journal.

Heibonsha encyclopedia: 10072 sentences
One of the largest Japanese encycolopedia. Heibonsha is the name of
the publisher.

Sentence Examples: 16946 sentences
It seems there sentences are taken from dictionaries. But I don't know
where they come.

----------------------------------------------------------------------



More information about the Corpora mailing list