[Corpora-List] Comparing learner frequencies with native frequencies

Mon Mar 6 11:09:20 UTC 2006

Dear corporistas,

I'm trying to compare word frequencies in our native speaker corpus and
our learner corpus. Having normalised the frequencies in both corpora to
frequencies per 10 million words, a simple subtraction still heavily skews
the results towards high-frequency words. I've tried taking the log of
both normalised frequencies before subtracting to get around the Zipfian
nature of word frequency distribution - this gives better results, but is
it well-motivated? I'd be grateful for any help you could give me, or any
pointers to previous work done in this area. Many thanks,

Dom

Dominic Glennon
Systems Manager
Cambridge University Press
01223 325595

Search the web's favourite learner dictionaries for free at Cambridge
Dictionaries Online:
<http://dictionary.cambridge.org>