[Corpora-List] Number of unique words in text for different languages

fatima zuhra fateeshah at yahoo.com
Sat Aug 14 08:13:39 UTC 2010


Perhaps page 5 of the paper, available from the following URL, contains useful information in this regard:
http://gandalf.aksis.uib.no/non/lrec2000/pdf/262.pdf
 
Regards.
 
Fatima Tuz Zuhra
Department of Computer Science,
University of Peshawar. Peshawar. Pakistan.
--- On Tue, 8/10/10, Emmanuel Prochasson <eprochasson at gmail.com> wrote:


From: Emmanuel Prochasson <eprochasson at gmail.com>
Subject: [Corpora-List] Number of unique words in text for different languages
To: corpora at uib.no
Date: Tuesday, August 10, 2010, 12:11 PM


Dear all,

I am working on a trilingual comparable corpus of French/English and
Japanese. I am running a simple word count on each part of the corpus
but found surprising results for Japanese.

For each part, I count the total number of words and the number of
/unique words/, that is I count every words only once, even if they
appear 1, 5 or 100 times. I POS-tagged each part of the corpus and
only keep the lemmatized version of every words (to group different
flexion of one words). Furthermore, I only focus on nouns, keeping the
"名詞:一般" tag for Japanese (noun:general) and all noun (including proper
nouns) in French/English. I use MeCab for Japanese and TreeTagger for
French/English.

Here are the results (Unique words/Total words).
Japanese : 189,798 / 5,174,800
English : 66,821 / 4,589,465
French : 23,970 / 1,796,183

What surprises me is that the number of unique nouns in Japanese is
three times the number of unique nouns in English, even though the
difference of total number of words in both language is not that large
(the ratio for French/English is more consistant for example).

As far as I can tell, the tokenization/POS-tagging looks /ok/ (ie : I
checked using google translate, it seems to make sense, but my lack of
skill in Japanese prevents me from investigating deeper).

Is this a normal result ?

Regards,

-- 
Emmanuel Prochasson

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



      
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20100814/6474c1b1/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list