[Corpora-List] Number of unique words in text for different languages
John F. Sowa
sowa at bestweb.net
Thu Aug 12 13:47:46 UTC 2010
On 8/12/2010 9:17 AM, Jim Breen wrote:
> Japanese morphological analysers such as MeCab, Chasen, etc. tend to
> over-split so that what might be considered a single word in English or
> French may end up as two or three elements in MeCab's output.
Over-splitting would increase the total word count, but reduce
the count of unique words. The huge number of unique words that
Emmanuel Prochasson found was probably the result of grouping
long Kanji strings into a single so-called noun.
For example, English 'life insurance company employee' would count as
4 words, but the German 'Lebensversicherungsgesellschaftsangestellter'
would be counted as just one word.
John Sowa
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list