[Corpora-List] Number of unique words in text for different languages

Jim Breen jimbreen at gmail.com
Thu Aug 12 23:21:10 UTC 2010


John F. Sowa wrote:
> On 8/12/2010 9:17 AM, Jim Breen wrote:
>> Japanese morphological analysers such as MeCab, Chasen, etc. tend to
>> over-split so that what might be considered a single word in English or
>> French may end up as two or three elements in MeCab's output.

> Over-splitting would increase the total word count, but reduce the
> count of unique words. The huge number of unique words that
> Emmanuel Prochasson found was  probably the result of grouping
> long Kanji strings into a single so-called noun.

In fact MeCab splits long kanji strings into the component words. For
example "kikanshizensokuchiryouyaku" (antiasthmatic drug) is typically split:
kikanshi + zensoku + chiryou + yaku. (I say "typically", because you have to
use a trained lexicon with MeCab, and there are several to choose from.)

Perhaps Emmanuel combined sequences of noun-tagged morphemes, but even
then I don't think it could have got such a high unique word count.

Jim

-- 
Jim Breen
Adjunct Snr Research Fellow, Clayton School of IT, Monash University
Treasurer: Hawthorn Rowing Club, Japanese Studies Centre
Graduate student: Language Technology Group, University of Melbourne

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list