[Corpora-List] Number of unique words in text for different languages

Kristina Hmeljak Sangawa kristina.hmeljak at guest.arnes.si
Sat Aug 14 14:31:09 UTC 2010


On 2010/08/10, at 9:11, Emmanuel Prochasson wrote:

> I am working on a trilingual comparable corpus of French/English and
> Japanese.
...
> What surprises me is that the number of unique nouns in Japanese is
> three times the number of unique nouns in English, even though the
> difference of total number of words in both language is not that large
> (the ratio for French/English is more consistant for example).


Another possible reason for the difference could be the way "nouns" are categorized in the three languages.
If you used MeCab with the Ipadic dictionary, as recommended on MeCab's download site,
then the POS category "名詞:一般" (noun:general) includes not only proper and common nouns
which correspond to nouns in English or French, but also categories such as 
1) "名詞-サ変接続" (noun:verbal), nouns which can be used with the light verb "suru" to form a verb;
2) "名詞-形容動詞語幹" (noun:adjective-na), nouns which can be also used as adjectives by adding the postfix -na;
3) "名詞-代名詞" (noun:pronoun);
4) "名詞-副詞可能" (noun:adverbal), nouns which can also be used as adverbials;
5) "名詞-非自立" (noun:bound) - nouns which can only be used in noun compounds and are then split (or oversplit, as Jim Breen and others suggested) into more units than the corresponding English or French nouns would.

I do not have exact numbers at hand, but the category noun:verbal is quite large and I think it could have influenced the difference in the total number of nouns, although that still does not account for the type/token ratio difference.


Kristina Hmeljak Sangawa
kristina.hmeljak at guest.arnes.si
Dept. of Asian and African Studies, Faculty of Arts, University of Ljubljana



_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list