[Corpora-List] fisher's exact test

Stefan Evert evert at IMS.Uni-Stuttgart.DE
Fri Nov 12 11:22:33 UTC 2004


Hi Leonoor, hi Ted,

one thing you have to be aware of is that you must use the rightFisher
measure in NSP to obtain p-values for Fisher's exact test (the
leftFisher values aren't directly meaningful in the context of
statistical hypothesis tests).

Fisher's test is know to be problematic for larger samples, especially
with skewed distributions, and it is traditionally only applied to
tables with very small numbers (as in Ted's example).  That said, the
NSP implementation (of rightFisher) uses a "naive" multiplicative
algorithm, which should give the most accurate results you can
normally hope to get. This leaves to possible problems:

a) The naive implementation can be excruiatingly slow for large
marginal frequencies (I typically have tables where n = 10^6 and np1
and n1p can be greater than 1000), especially since it's written in
pure Perl.

b) In such extreme cases, you might even get an underflow error, when
the computed p-values are below 10^{-260} or so (I've observed values
as small as 10^{-10000} for cooccurrence data!).

You might want to consider using the log-likelihood measure (ll)
instead, which gives a very good approximation to the exact p-values
of Fisher's test, is easy to compute, and is numerically stable.

If you really want to use Fisher's test on large samples, there's an
implementation in my UCS toolkit (sorry for the shameless plug :o),
which uses statistical functions from R (www.r-project.org) and a Perl
wrapper to get accurate values even in extreme cases (at least I hope
it does, I haven't really smoke-tested it yet). It makes the same
assumption as NSP, though, that n11 < n22 (or rather, it even assumes
that n11 is small compared to n).  If you're interested, you can
download the UCS toolkit from

http://www.collocations.de/software.html

The installation isn't quite as simple as with NSP (since UCS has
additional requirements), but it is known to run on Linux, Mac OS X,
and experimentally in a Cygwin environment on Windows.  The good news
is that you can easily import NSP data sets for bigram data. :o)

Best wishes,
Stefan


> Will you send me the input you are giving to Fisher's Left test? I think
> that's the easiest way to figure things out!
>
> Cordially,
> Ted (of NSP :)
>
> On Thu, 11 Nov 2004, Beek L.J.van der wrote:
>
> >
> > Does anyone know a good Perl implementation of Fisher's Exact Test for
> > very skewed distributions (with frequencies ranging from 0 to 1000+)?
> >
> > I've tried the NSP-package (version 0.71), but it doesn't always give the
> > correct results. Has anyone noticed (or even better: fixed) this before?
> >
> > thanks,
> > Leonoor
> >
> > --
> > Leonoor van der Beek, vdbeek at let.rug.nl
> > http://odur.let.rug.nl/~vdbeek
> > Rijksuniversiteit Groningen, Informatiekunde
> > Pb 716, 9700 AS Groningen, The Netherlands
> > tel. +31.50.3635977, fax  +31.50.3636855
> >
> >
>
> --
> Ted Pedersen
> http://www.d.umn.edu/~tpederse
>

--
______________________________________________________________________
Stefan Evert                                     purl.org/stefan.evert
http://www.collocations.de/                             schtepf at gmx.de



More information about the Corpora mailing list