[Corpora-List] Number of unique words in text for different languages

Jim Breen jimbreen at gmail.com
Thu Aug 12 13:17:04 UTC 2010


On 12 August 2010 18:45,   Emmanuel Prochasson wrote:
> I am working on a trilingual comparable corpus of French/English and
> Japanese. I am running a simple word count on each part of the corpus
> but found surprising results for Japanese.
>
> For each part, I count the total number of words and the number of
> /unique words/, that is I count every words only once, even if they
> appear 1, 5 or 100 times. I POS-tagged each part of the corpus and
> only keep the lemmatized version of every words (to group different
> flexion of one words). Furthermore, I only focus on nouns, keeping the
> "??:??" tag for Japanese (noun:general) and all noun (including proper
> nouns) in French/English. I use MeCab for Japanese and TreeTagger for
> French/English.
>
> Here are the results (Unique words/Total words).
> Japanese : 189,798 / 5,174,800
> English : 66,821 / 4,589,465
> French : 23,970 / 1,796,183
>
> What surprises me is that the number of unique nouns in Japanese is
> three times the number of unique nouns in English, even though the
> difference of total number of words in both language is not that large
> (the ratio for French/English is more consistent for example).
>
> As far as I can tell, the tokenization/POS-tagging looks /ok/ (ie : I
> checked using google translate, it seems to make sense, but my lack of
> skill in Japanese prevents me from investigating deeper).

Japanese morphological analysers such as MeCab, Chasen, etc. tend to
over-split so that what might be considered a single word in English or
French may end up as two or three elements in MeCab's output. For
example, "industrialization" is "jigyouka" in Japanese. MeCab (depending
on which lexicon you are using) will typically break it into "jigyou" and "ka",
i.e. "industry" and "ization". Both are tagged as nouns; noun-general
and noun-suffix.

That said, I would not expect a factor of three difference.

As a test, I put the Japanese and English components of the Tanaka
Corpus (approx. 150,000 sentence pairs) through MeCab and TreeTagger.
The unique noun counts (all meishi in Japanese and NN in English) were
13,725 and 12,106 respectively. That is more what I would expect.

Your number of unique words in Japanese seems extraordinarily large.
As a comparison, the MeCab output from the Tanaka sentences is only
about  19,000 unique tokens.

If you contact me offline, I may be able to help with analysing the output from
MeCab.

Jim

-- 
Jim Breen
Adjunct Snr Research Fellow, Clayton School of IT, Monash University
Treasurer: Hawthorn Rowing Club, Japanese Studies Centre
Graduate student: Language Technology Group, University of Melbourne

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list