[Corpora-List] Number of unique words in text for different languages

John F. Sowa sowa at bestweb.net
Thu Aug 12 13:47:46 UTC 2010


On 8/12/2010 9:17 AM, Jim Breen wrote:
> Japanese morphological analysers such as MeCab, Chasen, etc. tend to
> over-split so that what might be considered a single word in English or
> French may end up as two or three elements in MeCab's output.

Over-splitting would increase the total word count, but reduce
the count of unique words.  The huge number of unique words that
Emmanuel Prochasson found was probably the result of grouping
long Kanji strings into a single so-called noun.

For example, English 'life insurance company employee' would count as
4 words, but the German 'Lebensversicherungsgesellschaftsangestellter'
would be counted as just one word.

John Sowa


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list