[Corpora-List] Mann-Whitney ranks test

Mon Nov 22 22:15:01 UTC 2004

Hi, everyone:

Can anyone advise me as to the use of the Mann-Whitney ranks test to
determine lexical differences between a homogeneous collection (such as
about 260,000 words of a single author) and a heterogeneous corpus (such
as the fiction subcorpora of the Brown Corpus)? Or perhaps can anyone
point me in the direction of a good resource that discusses the issue?
I had thought about splitting each corpus into segments of about 20,000
words and then running Mann-Whitney tests against lexical items of
interest (body parts in particular).  After having read Adam Kilgariff
("Comparing Corpora" 2001 and others) I know that with heterogeneous
corpora the Mann-Whitney goes some way towards defeating the ease of
rejecting the null hypothesis due to high frequency words, that ease
making inappropriate hypothesis testing with chi-square or
log-likelihood.  If a homogeneous corpus is split into 20,000-word
adjacent segments for the Mann-Whitney, isn't it likely that the
bunchiness characteristic will still be present in the homogeneous
samples?  And, furthermore, is it appropriate to use statistical tables
to test the null hypothesis given that the samples from the homogeneous
corpus are all from the same author while the heterogeneous samples are
from different authors, on average about 10 different ones per
20,000-word sample?

Many thanks,

Don

--

Don.Hardy at Colostate.edu
http://textant.colostate.edu