[Corpora-List] n-grams (follow-up question)
Dirk Ludtke
dludtke at pine.kuee.kyoto-u.ac.jp
Wed Aug 28 06:13:15 UTC 2002
A slightly related question:
I am wondering if anyone could point me to work on n-gram reoccurance.
A word (or n-gram) occurs k times in a corpus of n words. What is the
probability that this word occurs again?
Especially for small k, this probability seems to depend not only on k
and n, but also on the ratio of words with low and high frequency.
Is there a nice way to approximate these probabilities. Maybe with
probability distributions? Is there a mathematic theory?
Thank you.
More information about the Corpora
mailing list