[Corpora-List] Number of unique words in text for different languages

Tue Aug 10 07:11:43 UTC 2010

Dear all,

I am working on a trilingual comparable corpus of French/English and
Japanese. I am running a simple word count on each part of the corpus
but found surprising results for Japanese.

For each part, I count the total number of words and the number of
/unique words/, that is I count every words only once, even if they
appear 1, 5 or 100 times. I POS-tagged each part of the corpus and
only keep the lemmatized version of every words (to group different
flexion of one words). Furthermore, I only focus on nouns, keeping the
"名詞:一般" tag for Japanese (noun:general) and all noun (including proper
nouns) in French/English. I use MeCab for Japanese and TreeTagger for
French/English.

Here are the results (Unique words/Total words).
Japanese : 189,798 / 5,174,800
English : 66,821 / 4,589,465
French : 23,970 / 1,796,183

What surprises me is that the number of unique nouns in Japanese is
three times the number of unique nouns in English, even though the
difference of total number of words in both language is not that large
(the ratio for French/English is more consistant for example).

As far as I can tell, the tokenization/POS-tagging looks /ok/ (ie : I
checked using google translate, it seems to make sense, but my lack of
skill in Japanese prevents me from investigating deeper).

Is this a normal result ?

Regards,

-- 
Emmanuel Prochasson

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora