[Corpora-List] fisher's exact test

ted pedersen tpederse at d.umn.edu
Fri Nov 12 02:02:54 UTC 2004


> Does anyone know a good Perl implementation of Fisher's Exact Test for
> very skewed distributions (with frequencies ranging from 0 to 1000+)?
>
> I've tried the NSP-package (version 0.71), but it doesn't always give the
> correct results. Has anyone noticed (or even better: fixed) this before?
>

NSP exact tests work pretty well for skewed distributions, however, in
general it is assuming that the data is coming from ngram counts, and
that is what has led to the problem above.

In particular, if we have a 2x2 table representing the bigram counts:

n11 n12 | n1p
n21 n22 | n2p
---------
np1 np2   npp

n11 represents the number of times w1 and w2 occur together, n12
represents the number of bigrams where w1 is the first word and w2 is not,
etc. Typically n22 is very large (since that represents the count of all
the other bigrams in the sample that aren't w1 and w2). Of course n11 is
much smaller than the sample size, making the distribution quite skewed.

Now, in the case of this user, the data is more like this:

 10 2 | 12
  3 1 |  4
-------
 13 3   16

which aren't for ngram counts of course. So here n22 < n11, and that
actually causes a problem for our exact test implementation! Implicity in
our code is the faulty assumption that people would only be using this
for collocations, and I'm glad to see I was wrong about that. :) But, we
should make this limitation more clearly known, and better yet  we should
just fix it, which I think we will!

This is discussed in a bit more detail below:

http://groups.yahoo.com/group/ngram/messages/15
http://groups.yahoo.com/group/ngram/messages/17

Now in the case above, if the table is reorganized so that it is
(equivalently) shown as below, everything is fine. So NSP exact tests
(for now) require that n22 > n11.

  1 2   |  3
  3 10  | 13
-------
  4 12    16

Cordially,
Ted

PS NSP turns 4 years old on November 30.  Big party in Duluth, you are
all invited. :)

--
Ted Pedersen
http://www.d.umn.edu/~tpederse



More information about the Corpora mailing list