[Corpora-List] Fwd: Number of unique words in text for different languages

Thu Aug 12 15:11:35 UTC 2010

Hi all,

As a disclaimer, I have not worked with any of the tokenizers. For the type
of results originally reported, however, I do have a suggestion for a
possible partial explanation, based on some experience with Spanish. There
is a real stylistic rule in Spanish which makes speakers and especially
writers avoid repeating the same 'content word' within the same or
contiguous sentences or clauses, using instead a synonym or paraphrase
(incidentally, this latter may be a partial explanation for the well-known
fact that Spanish sentences are longer than English ones on average). We
might expect, then, to find a larger number of words in smaller comparable
corpora in Spanish than in English. *If* Japanese had some similar stylistic
tendencies, then that is likely to be a part of such results.

It would be instructive to examine Spanish vs. English in this regard, which
would show that pragmatic considerations (word choice restrictions) may
influence distributions as much as 'grammatical' ones (like the German
'monster nouns' vs. the English in compound nouns that John mentioned).

Jim

On Thu, Aug 12, 2010 at 8:17 AM, Jim Breen <jimbreen at gmail.com> wrote:

> On 12 August 2010 18:45,   Emmanuel Prochasson wrote:
> > I am working on a trilingual comparable corpus of French/English and
> > Japanese. I am running a simple word count on each part of the corpus
> > but found surprising results for Japanese.
> >
> > For each part, I count the total number of words and the number of
> > /unique words/, that is I count every words only once, even if they
> > appear 1, 5 or 100 times. I POS-tagged each part of the corpus and
> > only keep the lemmatized version of every words (to group different
> > flexion of one words). Furthermore, I only focus on nouns, keeping the
> > "??:??" tag for Japanese (noun:general) and all noun (including proper
> > nouns) in French/English. I use MeCab for Japanese and TreeTagger for
> > French/English.
> >
> > Here are the results (Unique words/Total words).
> > Japanese : 189,798 / 5,174,800
> > English : 66,821 / 4,589,465
> > French : 23,970 / 1,796,183
> >
> > What surprises me is that the number of unique nouns in Japanese is
> > three times the number of unique nouns in English, even though the
> > difference of total number of words in both language is not that large
> > (the ratio for French/English is more consistent for example).
> >
> > As far as I can tell, the tokenization/POS-tagging looks /ok/ (ie : I
> > checked using google translate, it seems to make sense, but my lack of
> > skill in Japanese prevents me from investigating deeper).
>
> Japanese morphological analysers such as MeCab, Chasen, etc. tend to
> over-split so that what might be considered a single word in English or
> French may end up as two or three elements in MeCab's output. For
> example, "industrialization" is "jigyouka" in Japanese. MeCab (depending
> on which lexicon you are using) will typically break it into "jigyou" and
> "ka",
> i.e. "industry" and "ization". Both are tagged as nouns; noun-general
> and noun-suffix.
>
> That said, I would not expect a factor of three difference.
>
> As a test, I put the Japanese and English components of the Tanaka
> Corpus (approx. 150,000 sentence pairs) through MeCab and TreeTagger.
> The unique noun counts (all meishi in Japanese and NN in English) were
> 13,725 and 12,106 respectively. That is more what I would expect.
>
> Your number of unique words in Japanese seems extraordinarily large.
> As a comparison, the MeCab output from the Tanaka sentences is only
> about  19,000 unique tokens.
>
> If you contact me offline, I may be able to help with analysing the output
> from
> MeCab.
>
> Jim
>
> --
> Jim Breen
> Adjunct Snr Research Fellow, Clayton School of IT, Monash University
> Treasurer: Hawthorn Rowing Club, Japanese Studies Centre
> Graduate student: Language Technology Group, University of Melbourne
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>

-- 
James L. Fidelholtz
Posgrado en Ciencias del Lenguaje
Instituto de Ciencias Sociales y Humanidades
Benemérita Universidad Autónoma de Puebla, MÉXICO

-- 
James L. Fidelholtz
Posgrado en Ciencias del Lenguaje
Instituto de Ciencias Sociales y Humanidades
Benemérita Universidad Autónoma de Puebla, MÉXICO
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20100812/6805b692/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora