Corpora: statistics in learner English

Rayson, Paul rayson at exchange.lancs.ac.uk
Fri Jan 18 12:03:19 UTC 2002


Dear Guo, Przemek,

I would suggest you can use the log-likelihood (sometimes called the likelihood
ratio) as an alternative to chi-squared. It can be calculated even for low
frequency/expectation words.

You can calculate the LL value for the lemma as well as the variants, you don't
mention the size of the corpora, so I've assumed each one is a million words.
It's the ratio of the two corpus sizes that it important I think.

> >
> > Learner corpus
> > keep 348 (88.5%)
> > keeps 15 (3.8%)
> > keeping 9 (2.3%)
> > kept 21 (5.4%)
> > Total 392 (100%)
> >
> > Native speaker corpus
> > keep 99 (58.2%)
> > keeps 14 (8.2%)
> > keeping 32 (18.8%)
> > kept 25 (14.7%)
> > Total 170 (99.9%)
>
Rounded to zero d.p. and relative to the corpus size rather than the lemma
total:

Lemma KEEP: LL = 90
keep 147
keeps 0
keeping 14
kept 0

This shows that keep is significantly overused and keeping is significantly
underused. But of course the lemma being overused as a whole is an important
factor to consider in your studies.

For more details on log-likelihood, see:

Rayson, P. and Garside, R. (2000). Comparing corpora using frequency profiling.
In proceedings of the workshop on Comparing Corpora, held in conjunction with
the 38th annual meeting of the Association for Computational Linguistics (ACL
2000). 1-8 October 2000, Hong Kong, pp. 1 - 6.
http://www.comp.lancs.ac.uk/computing/users/paul/publications/rg_acl2000.pdf

I also have an online LL calculator:
http://lingo.lancs.ac.uk/llwizard.html

Regards,
Paul.



More information about the Corpora mailing list