[Corpora-List] Comparing learner frequencies with native frequencies

Tue Mar 7 10:20:12 UTC 2006

Hi Dominic,

To add a few more pointers to what Adam and Bayan have already posted:
I'd recommend the log-likelihood measure. There is arguably a case for
using F-Exact when dealing with two corpora of very different sizes, or
when comparing very low frequency words but the practical significance
(i.e. what conclusions can you draw anyway) of those comparisons should
be considered over statistical significance. 

As Adam says, there are a variety of statistics that can be used. For a
comparison of chi-squared and log-loglikehood see:

Rayson P., Berridge D. and Francis B. (2004). Extending the Cochran rule
for the comparison of word frequencies between corpora. In Volume II of
Purnelle G., Fairon C., Dister A. (eds.) Le poids des mots: Proceedings
of the 7th International Conference on Statistical analysis of textual
data (JADT 2004), Louvain-la-Neuve, Belgium, March 10-12, 2004, Presses
universitaires de Louvain, pp. 926 - 936.
http://www.comp.lancs.ac.uk/computing/users/paul/publications/rbf04_jadt
.pdf

and for a description of the method used in Wmatrix, see:

Rayson, P. and Garside, R. (2000). Comparing corpora using frequency
profiling. In proceedings of the workshop on Comparing Corpora, held in
conjunction with the 38th annual meeting of the Association for
Computational Linguistics (ACL 2000). 1-8 October 2000, Hong Kong, pp. 1
- 6.
http://www.comp.lancs.ac.uk/computing/users/paul/publications/rg_acl2000
.pdf

and

http://ucrel.lancs.ac.uk/llwizard.html

The log-likelihood statistic is also used in Mike Scott's WordSmith
tools to find keywords:
http://www.lexically.net/wordsmith/

I agree with Adam about taking account of word 'burstiness'. Either you
need to incorporate range/dispersion in an adjusted frequency measure or
examine them by hand for the keywords you identify. 

Finally, in your case, you need to consider differences in spelling (due
to learner errors) which will affect any comparison you do, but perhaps
that is what you are looking to find from such a comparison anyway?

Regards,
Paul.

Dr. Paul Rayson
Director of UCREL
Computing Department, Infolab21, South Drive, Lancaster University,
Lancaster, LA1 4WA, UK.
Web: http://www.comp.lancs.ac.uk/computing/users/paul/
Tel: +44 1524 510357 Fax: +44 1524 510492

-----Original Message-----
From: owner-corpora at lists.uib.no [mailto:owner-corpora at lists.uib.no] On
Behalf Of Dominic Glennon
Sent: 06 March 2006 11:09
To: corpora at lists.uib.no
Subject: [Corpora-List] Comparing learner frequencies with native
frequencies

Dear corporistas,

I'm trying to compare word frequencies in our native speaker corpus and
our learner corpus. Having normalised the frequencies in both corpora to
frequencies per 10 million words, a simple subtraction still heavily
skews
the results towards high-frequency words. I've tried taking the log of
both normalised frequencies before subtracting to get around the Zipfian
nature of word frequency distribution - this gives better results, but
is
it well-motivated? I'd be grateful for any help you could give me, or
any
pointers to previous work done in this area. Many thanks,

Dom

Dominic Glennon
Systems Manager
Cambridge University Press
01223 325595

Search the web's favourite learner dictionaries for free at Cambridge
Dictionaries Online:
<http://dictionary.cambridge.org>