Corpora: statistics in CL question

Fri Mar 31 20:35:23 UTC 2000

"Alexander S. Yeh" wrote:

> >In most studies of z-scores and t-scores in computational linguistics,
> >you tend to find that scores are too high.  When you compute scores
> >for bigrams, for example, you would expect 5% of the scores would be
> >greater than 1.65, but you tend to find more than that.

Thanks to Kenneth Church, Ted Dunning, Wessel Kraaij and Mitch Marcus for
responding to my query on and outside of this list.

The two basic types of explanation that I received were:

1. Often in natural language, the rare events happen much more often than
with a Gaussion (normal) distribution: the distribution tails have much
more mass than with a Gaussina (normal) distribution.

2. The tests assume independent samples. Often, this is not true in
natural language processing. An example is that a content word appearing
in a document tends to increase the chances of finding that same word
later on in that document.

-Alex Yeh