[Corpora-List] Number of unique words in text for different languages

Sat Aug 14 08:55:43 UTC 2010

Is there an 'apple and oranges' dimension to this question when it
involves comparing relatively isolating languages to more synthetic
ones? Is picking the morphological analyzer or its settings going to
bring us closer to an 'apples and apples' comparison when, to begin
with, morphologically speaking we've got two different fruits in hand
with the two languages being compared?

David Wible
Dean, College of Humanities
National Central University
Jhongli. Taiwan

On Saturday, August 14, 2010, fatima zuhra <fateeshah at yahoo.com> wrote:
>
> Perhaps page 5 of the paper, available from the following URL, contains useful information in this regard:
> http://gandalf.aksis.uib.no/non/lrec2000/pdf/262.pdf
>
> Regards.
>
> Fatima Tuz Zuhra
> Department of Computer Science,
> University of Peshawar. Peshawar. Pakistan.
> --- On Tue, 8/10/10, Emmanuel Prochasson <eprochasson at gmail.com> wrote:
>
>
> From: Emmanuel Prochasson <eprochasson at gmail.com>
> Subject: [Corpora-List] Number of unique words in text for different languages
> To: corpora at uib.no
> Date: Tuesday, August 10, 2010, 12:11 PM
>
> Dear all,
>
> I am working on a trilingual comparable corpus of French/English and
> Japanese. I am running a simple word count on each part of the corpus
> but found surprising results for Japanese.
>
> For each part, I count the total number of words and the number of
> /unique words/, that is I count every words only once, even if they
> appear 1, 5 or 100 times. I POS-tagged each part of the corpus and
> only keep the lemmatized version of every words (to group different
> flexion of one words). Furthermore, I only focus on nouns, keeping the
> "名詞:一般" tag for Japanese (noun:general) and all noun (including proper
> nouns) in French/English. I use MeCab for Japanese and TreeTagger for
> French/English.
>
> Here are the results (Unique words/Total words).
> Japanese : 189,798 / 5,174,800
> English : 66,821 / 4,589,465
> French : 23,970 / 1,796,183
>
> What surprises me is that the number of unique
>  nouns in Japanese is
> three times the number of unique nouns in English, even though the
> difference of total number of words in both language is not that large
> (the ratio for French/English is more consistant for example).
>
> As far as I can tell, the tokenization/POS-tagging looks /ok/ (ie : I
> checked using google translate, it seems to make sense, but my lack of
> skill in Japanese prevents me from investigating deeper).
>
> Is this a normal result ?
>
> Regards,
>
> --
> Emmanuel Prochasson
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no <http://us.mc343.mail.yahoo.com/mc/compose?to=Corpora@uib.no>
> http://mailman.uib.no/listinfo/corpora
>
>

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora