[Corpora-List] summary n-grams (follow-up question)

Thu Aug 29 08:55:26 UTC 2002

Thanks to the people who answered my question from yesterday. I was
extremely surprised by the speed the first answers came in (I couldn't
even type so fast). Since some of the replies were not send to the list
I would like to post a summary.

The question was:
> A word (or n-gram) occurs k times in a corpus of n words.
> What is the probability that this word occurs again?

The order is the one I received the replies in. At the end of this email
I will write a bit more about what I want to do with these
probabilities.

----------------

Sven C. Martin suggested to use Maximum Likelihood estimates with
discounting (smoothing) to give probability mass to unobserved events:

> The so-called Maximum Likelihood estimations of
> probabilities p(w|h), where w is a word and h is the
> (n-1)-tuple of predecessor words, are in fact relative
> frequencies p(w|h) = N(h,w)/N(h), where N(h,w) is the
> frequency of the n-tuple in some training corpus.

and pointed to

> Chapter 4 of F. Jelinek: "Statistical methods for
> speech recognition", MIT Press, Cambridge, MA, 1997

and

> H. Ney et al.: "Statistical language modeling
> using leaving-one-out" in S. Young and G.
> Bloothooft: "Corpus-based methods in language and
> speech processing", Kluwer, Dordrecht, 1997

It seems to me that the mentioned methods solve a more general problem.
My question would be answered by getting p(w|h), where w ist the n-gram
i am interested in and h is empty.

----------------

Stefan Th. Gries wrote

> Kenneth W. Church's paper called "Empirical estimates
> of adaptation: The chance of two Noriegas is closer to
> p/2 that p^2"

(Kenneth W. Church did also post later. See below)

----------------

Oliver wrote

> It sounds a bit like something I read recently in Geoff
> Sampson's "Empirical Linguistics", where he describes
> the Good-Turing method for estimating the probabilities
> of event that haven't occurred yet. On the way this gives
> corrected probabilities for things that occurred
> only once, as their probability might in fact be for
> them to occur only a fraction of one, which, however,
> is not observable.

----------------

Ken Church points to two of his papers which are available at his web
page http://www.research.att.com/~kwc/

> (1) Poisson Mixtures, and
> (2) Empirical Estimates of Adaptation: The chance
> of Two Noriega's is closer to p/2 than p^2

He also links to a paper of Ronald Rosenfeld about adaption
> http://citeseer.nj.nec.com/rosenfeld96maximum.html

-----------------

Thank you very much again. I will have enough material to read (at
least) for the next days :)

Maybe I should also write a bit about the application. I want to use
these probabilities for examining the quality of different language
patterns in classification problems. A pattern could be an n-gram but
also a combination of words with pos-tags or textformat information.

An easy example: I want to decide whether a word is a noun or not. I
have a pos-tagged corpus and extract how often different patterns (like
n-grams with different n) led to nouns and how often not. The question
is, which particular patterns are better than others.

As a score for the patterns, I am using the information gain (entropy).
But I have the feeling that it is not enough to approximate the
probability of the classes but also the probability of the pattern.

Big patterns (like 5-grams) tend to have a good probability, even in
case we have seen them only once or twice. This leads to a good
information gain, totally ignoring that we will not likely see these
patterns again. A 5-gram which occured 2 times has a much lower
probability than a 2-gram which occured also 2 times.

Dirk Ludtke