<table cellspacing="0" cellpadding="0" border="0" ><tr><td valign="top" style="font: inherit;"><DIV><BR>Perhaps page 5 of the paper, available from the following URL, contains useful information in this regard:</DIV>
<DIV><A href="http://gandalf.aksis.uib.no/non/lrec2000/pdf/262.pdf">http://gandalf.aksis.uib.no/non/lrec2000/pdf/262.pdf</A></DIV>
<DIV> </DIV>
<DIV>Regards.</DIV>
<DIV> </DIV>
<DIV>Fatima Tuz Zuhra</DIV>
<DIV>Department of Computer Science,</DIV>
<DIV>University of Peshawar. Peshawar. Pakistan.<BR>--- On <B>Tue, 8/10/10, Emmanuel Prochasson <I><eprochasson@gmail.com></I></B> wrote:<BR></DIV>
<BLOCKQUOTE style="PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: rgb(16,16,255) 2px solid"><BR>From: Emmanuel Prochasson <eprochasson@gmail.com><BR>Subject: [Corpora-List] Number of unique words in text for different languages<BR>To: corpora@uib.no<BR>Date: Tuesday, August 10, 2010, 12:11 PM<BR><BR>
<DIV class=plainMail>Dear all,<BR><BR>I am working on a trilingual comparable corpus of French/English and<BR>Japanese. I am running a simple word count on each part of the corpus<BR>but found surprising results for Japanese.<BR><BR>For each part, I count the total number of words and the number of<BR>/unique words/, that is I count every words only once, even if they<BR>appear 1, 5 or 100 times. I POS-tagged each part of the corpus and<BR>only keep the lemmatized version of every words (to group different<BR>flexion of one words). Furthermore, I only focus on nouns, keeping the<BR>"名詞:一般" tag for Japanese (noun:general) and all noun (including proper<BR>nouns) in French/English. I use MeCab for Japanese and TreeTagger for<BR>French/English.<BR><BR>Here are the results (Unique words/Total words).<BR>Japanese : 189,798 / 5,174,800<BR>English : 66,821 / 4,589,465<BR>French : 23,970 / 1,796,183<BR><BR>What surprises me is that the number of unique
nouns in Japanese is<BR>three times the number of unique nouns in English, even though the<BR>difference of total number of words in both language is not that large<BR>(the ratio for French/English is more consistant for example).<BR><BR>As far as I can tell, the tokenization/POS-tagging looks /ok/ (ie : I<BR>checked using google translate, it seems to make sense, but my lack of<BR>skill in Japanese prevents me from investigating deeper).<BR><BR>Is this a normal result ?<BR><BR>Regards,<BR><BR>-- <BR>Emmanuel Prochasson<BR><BR>_______________________________________________<BR>Corpora mailing list<BR><A href="http://us.mc343.mail.yahoo.com/mc/compose?to=Corpora@uib.no" ymailto="mailto:Corpora@uib.no">Corpora@uib.no</A><BR><A href="http://mailman.uib.no/listinfo/corpora" target=_blank>http://mailman.uib.no/listinfo/corpora</A><BR></DIV></BLOCKQUOTE></td></tr></table><br>