[Corpora-List] Number of unique words in text for different languages

Sun Aug 15 01:20:30 UTC 2010

Or is morphology merely in the preparation of the fruit, the preferred slicing?

Justin Washtell
University of Leeds

________________________________________
From: corpora-bounces at uib.no [corpora-bounces at uib.no] On Behalf Of David Wible [wible at stringnet.org]
Sent: 14 August 2010 09:55
To: fatima zuhra
Cc: corpora at uib.no; Emmanuel Prochasson
Subject: Re: [Corpora-List] Number of unique words in text for different        languages

Is there an 'apple and oranges' dimension to this question when it
involves comparing relatively isolating languages to more synthetic
ones? Is picking the morphological analyzer or its settings going to
bring us closer to an 'apples and apples' comparison when, to begin
with, morphologically speaking we've got two different fruits in hand
with the two languages being compared?

David Wible
Dean, College of Humanities
National Central University
Jhongli. Taiwan

On Saturday, August 14, 2010, fatima zuhra <fateeshah at yahoo.com> wrote:
>
> Perhaps page 5 of the paper, available from the following URL, contains useful information in this regard:
> http://gandalf.aksis.uib.no/non/lrec2000/pdf/262.pdf
>
> Regards.
>
> Fatima Tuz Zuhra
> Department of Computer Science,
> University of Peshawar. Peshawar. Pakistan.
> --- On Tue, 8/10/10, Emmanuel Prochasson <eprochasson at gmail.com> wrote:
>
>
> From: Emmanuel Prochasson <eprochasson at gmail.com>
> Subject: [Corpora-List] Number of unique words in text for different languages
> To: corpora at uib.no
> Date: Tuesday, August 10, 2010, 12:11 PM
>
> Dear all,
>
> I am working on a trilingual comparable corpus of French/English and
> Japanese. I am running a simple word count on each part of the corpus
> but found surprising results for Japanese.
>
> For each part, I count the total number of words and the number of
> /unique words/, that is I count every words only once, even if they
> appear 1, 5 or 100 times. I POS-tagged each part of the corpus and
> only keep the lemmatized version of every words (to group different
> flexion of one words). Furthermore, I only focus on nouns, keeping the
> "名詞:一般" tag for Japanese (noun:general) and all noun (including proper
> nouns) in French/English. I use MeCab for Japanese and TreeTagger for
> French/English.
>
> Here are the results (Unique words/Total words).
> Japanese : 189,798 / 5,174,800
> English : 66,821 / 4,589,465
> French : 23,970 / 1,796,183
>
> What surprises me is that the number of unique
>  nouns in Japanese is
> three times the number of unique nouns in English, even though the
> difference of total number of words in both language is not that large
> (the ratio for French/English is more consistant for example).
>
> As far as I can tell, the tokenization/POS-tagging looks /ok/ (ie : I
> checked using google translate, it seems to make sense, but my lack of
> skill in Japanese prevents me from investigating deeper).
>
> Is this a normal result ?
>
> Regards,
>
> --
> Emmanuel Prochasson
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no <http://us.mc343.mail.yahoo.com/mc/compose?to=Corpora@uib.no>
> http://mailman.uib.no/listinfo/corpora
>
>

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora