[Corpora-List] Number of unique words in text for differentlanguages

Fri Aug 13 09:04:09 UTC 2010

Hi all,

I also think the behavior of compounds in Japanese might be the reason, as
these don't generally increase the type count in English but very much do
for Japanese (but like Jim says, that largely depends on the type of
splitting used). 

I think the comparison with German is probably quite informative. For
instance this paper:

- Evert, Stefan (2004), A simple LNRE model for random character sequences.
In: Proceedings of JADT 2004, 411-422.

finds type-token-ratios for English and German nouns to be:

English: 217,527 / 19,000,000
German: 1,556,203 / 48,000,000

So again, almost three times as many nouns for German, including compounds.
However keep in mind that vocabulary does not grow linearly, since finding
new words becomes progressively more difficult the more types we have seen
(take a look at Harald Baayen's work for details on the behavior of
vocabulary growth). 

I think the best way to assess the behavior of the analyzers might be to
take a parallel corpus, which should contain about the same amount of
referents for English and Japanese, and see how many tokens are produced for
each language - that way you can estimate the bias introduced by the
splitting strategy.

Best,
Amir Zeldes
------------------
Institut für deutsche Sprache und Linguistik
Humboldt-Universität zu Berlin
Unter den Linden 6
D-10099 Berlin

Tel: +49-(0)30-2093-9727

URL:
http://www.linguistik.hu-berlin.de/institut/professuren/korpuslinguistik/mit
arbeiter-innen-en/amir/standardseite

> -----Original Message-----
> From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of
> Jim Breen
> Sent: Friday, August 13, 2010 1:21
> To: corpora at uib.no
> Subject: Re: [Corpora-List] Number of unique words in text for
> differentlanguages
> 
> John F. Sowa wrote:
> > On 8/12/2010 9:17 AM, Jim Breen wrote:
> >> Japanese morphological analysers such as MeCab, Chasen, etc. tend to
> >> over-split so that what might be considered a single word in English or
> >> French may end up as two or three elements in MeCab's output.
> 
> > Over-splitting would increase the total word count, but reduce the
> > count of unique words. The huge number of unique words that
> > Emmanuel Prochasson found was  probably the result of grouping
> > long Kanji strings into a single so-called noun.
> 
> In fact MeCab splits long kanji strings into the component words. For
> example "kikanshizensokuchiryouyaku" (antiasthmatic drug) is typically
> split:
> kikanshi + zensoku + chiryou + yaku. (I say "typically", because you have
> to
> use a trained lexicon with MeCab, and there are several to choose from.)
> 
> Perhaps Emmanuel combined sequences of noun-tagged morphemes, but even
> then I don't think it could have got such a high unique word count.
> 
> Jim
> 
> --
> Jim Breen
> Adjunct Snr Research Fellow, Clayton School of IT, Monash University
> Treasurer: Hawthorn Rowing Club, Japanese Studies Centre
> Graduate student: Language Technology Group, University of Melbourne
> 
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora