[Corpora-List] n-grams (follow-up question)

Dirk Ludtke dludtke at pine.kuee.kyoto-u.ac.jp
Wed Aug 28 06:13:15 UTC 2002


A slightly related question:

I am wondering if anyone could point me to work on n-gram reoccurance.

A word (or n-gram) occurs k times in a corpus of n words. What is the
probability that this word occurs again?

Especially for small k, this probability seems to depend not only on k
and n, but also on the ratio of words with low and high frequency.

Is there a nice way to approximate these probabilities. Maybe with
probability distributions? Is there a mathematic theory?

Thank you.



More information about the Corpora mailing list