[Corpora-List] Statistical tests for corpus studies

Adam Kilgarriff adam.kilgarriff at itri.brighton.ac.uk
Thu May 8 15:19:28 UTC 2003


Rayson, Paul wrote:

>But there is a problem with the Mann-Whitney test of too many zeros in the slices, as your IJCL paper points out Adam. For example, in the LOB and Brown comparison only words with a frequency of 30 or more (in the joint corpus) had few enough zeros for the test to be applicable. This means that 92% of the word types in the joint corpus were omitted from the comparison.
>
But if there isn't enough data we shouldn't be drawing any inferences,
so that seems right.  A name or technical term that gets used lots of
times, but in only one or two documents, is not good for basing any
 inferences on.  (Some thought has to be given to slice size, and how
the corpus is to be sliced up, which will interact with the number of
 non-zero values you'll get for the test.)

A couple of people asked for an e-version of the 'Comparing Corpora' - see

http://www.itri.bton.ac.uk/~Adam.Kilgarriff/publications.html#2001

Adam


--

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Adam Kilgarriff
ITRI, University of Brighton                   tel: (44) 1273 642919
Lewes Road, Brighton BN2 4GJ, UK               fax: (44) 1273 642908
adam at itri.bton.ac.uk     http://www.itri.bton.ac.uk/~Adam.Kilgarriff
  and
Lexicography MasterClass Ltd.
71 Freshfield Road, Brighton BN2 0BL, UK       tel: (44) 1273 705773
adam at lexmasterclass.com                http://www.lexmasterclass.com
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%



More information about the Corpora mailing list