[Corpora-List] Comparing learner frequencies with native frequencies

Adam Kilgarriff adam at lexmasterclass.com
Tue Mar 7 06:38:32 UTC 2006


Dominic, Abdou,

First, the problem is analogous to collocate-finding, so the same range of
stats such as MI and log likelihood can be used.  As with collocate-finding,
there's a balance to be struck between pure, mathematical surprisingness,
and the fact that commoner phenomena are, all else being equal, more
interesting than rarer ones.  Not-too-technical survey available at 
http://www.kilgarriff.co.uk/Publications/1996-K-AISB.pdf 

Second, "burstiness" - words occurring frequently in particular documents
but not much otherwise. If you don't make provision for it, many of the
words thrown up will be 'topic' words used a lot in a few texts but not
interestingly different between the two text types.  There are plenty of
ways to address it; survey above describes an "adjusted frequency" metric, I
compared Brown and LOB using document counts and the non-parametric
Mann-Whitney test
http://www.kilgarriff.co.uk/Publications/1996-K-CHumBergen-Chisq.txt. 

Where the docs are all different lengths, it's trickier; an elegant general
solution is given by 

P. Savicky, J. Hlavacova. Measures of Word Commonness. Journal of
Quantitative Linguistics, Vol. 9, 2003, No. 3, pp. 215-231. 

They (1) divide the corpus into same-length "pseudodocuments", and count the
document frequency of the term in each pseudodoc; (2) to avoid problems
cuased by the arbitrary cuts between docs, they consider all possible start-
and end-points for the pseudodocs, and average.  We're implementing the
approach for text-type comparison in the Sketch Engine
http://www.sketchengine.co.uk (and would be interested to use your data as a
test set).

Third, "other differences between the subcorpora": unless the two corpora
are very well matched in all ways but the text-type distinction you are
interested in, what very often happens is that the stats identify some
different dimension of difference between the corpora and that aspect swamps
out the one you wanted to find.  LOB/Brown was a nice test set because the
corpora are carefully set up to be matched. Even so, non-linguistic US vs UK
differences like cricket vs baseball were nicely thrown up by the stats!

All the best,

Adam


-----Original Message-----
From: owner-corpora at lists.uib.no [mailto:owner-corpora at lists.uib.no] On
Behalf Of Dominic Glennon
Sent: 06 March 2006 11:09
To: corpora at lists.uib.no
Subject: [Corpora-List] Comparing learner frequencies with native
frequencies

Dear corporistas,

I'm trying to compare word frequencies in our native speaker corpus and
our learner corpus. Having normalised the frequencies in both corpora to
frequencies per 10 million words, a simple subtraction still heavily skews
the results towards high-frequency words. I've tried taking the log of
both normalised frequencies before subtracting to get around the Zipfian
nature of word frequency distribution - this gives better results, but is
it well-motivated? I'd be grateful for any help you could give me, or any
pointers to previous work done in this area. Many thanks,

Dom

Dominic Glennon
Systems Manager
Cambridge University Press
01223 325595

Search the web's favourite learner dictionaries for free at Cambridge
Dictionaries Online:
<http://dictionary.cambridge.org>



More information about the Corpora mailing list