[Corpora-List] Statistical tests for corpus studies
Adam Kilgarriff
adam.kilgarriff at itri.brighton.ac.uk
Thu May 8 15:19:28 UTC 2003
Rayson, Paul wrote:
>But there is a problem with the Mann-Whitney test of too many zeros in the slices, as your IJCL paper points out Adam. For example, in the LOB and Brown comparison only words with a frequency of 30 or more (in the joint corpus) had few enough zeros for the test to be applicable. This means that 92% of the word types in the joint corpus were omitted from the comparison.
>
But if there isn't enough data we shouldn't be drawing any inferences,
so that seems right. A name or technical term that gets used lots of
times, but in only one or two documents, is not good for basing any
inferences on. (Some thought has to be given to slice size, and how
the corpus is to be sliced up, which will interact with the number of
non-zero values you'll get for the test.)
A couple of people asked for an e-version of the 'Comparing Corpora' - see
http://www.itri.bton.ac.uk/~Adam.Kilgarriff/publications.html#2001
Adam
--
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Adam Kilgarriff
ITRI, University of Brighton tel: (44) 1273 642919
Lewes Road, Brighton BN2 4GJ, UK fax: (44) 1273 642908
adam at itri.bton.ac.uk http://www.itri.bton.ac.uk/~Adam.Kilgarriff
and
Lexicography MasterClass Ltd.
71 Freshfield Road, Brighton BN2 0BL, UK tel: (44) 1273 705773
adam at lexmasterclass.com http://www.lexmasterclass.com
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
More information about the Corpora
mailing list