[Corpora-List] Comparing learner frequencies with native frequencies

Bayan Shawar bshawar at yahoo.com
Tue Mar 7 09:29:20 UTC 2006


Dear Dominic,
     Paul Rayson, in Lancaster University developed a
Wmatrix tool, which compares between two files,
corpora in three levels, pos, semantic, and lexical.
It accepts even non equal size files and produce
logliklihood.

Rayson, P. (2003). Matrix: a statistical method and
software tool for linguistic analysis through corpus
comparison. Ph.D. thesis. Lancaster University.

I also used this tool within my research for
comparison issues,
Abu Shawar B and Atwell E. 2005. A chatbot system as a
tool to animate acorpus. ICAME Journal. Vol. 29, pp.
5-23.

Hopefully this is useful,
Bayan Abu Shawar

--- Adam Kilgarriff <adam at lexmasterclass.com> wrote:

> Dominic, Abdou,
> 
> First, the problem is analogous to
> collocate-finding, so the same range of
> stats such as MI and log likelihood can be used.  As
> with collocate-finding,
> there's a balance to be struck between pure,
> mathematical surprisingness,
> and the fact that commoner phenomena are, all else
> being equal, more
> interesting than rarer ones.  Not-too-technical
> survey available at 
>
http://www.kilgarriff.co.uk/Publications/1996-K-AISB.pdf
> 
> 
> Second, "burstiness" - words occurring frequently in
> particular documents
> but not much otherwise. If you don't make provision
> for it, many of the
> words thrown up will be 'topic' words used a lot in
> a few texts but not
> interestingly different between the two text types. 
> There are plenty of
> ways to address it; survey above describes an
> "adjusted frequency" metric, I
> compared Brown and LOB using document counts and the
> non-parametric
> Mann-Whitney test
>
http://www.kilgarriff.co.uk/Publications/1996-K-CHumBergen-Chisq.txt.
> 
> 
> Where the docs are all different lengths, it's
> trickier; an elegant general
> solution is given by 
> 
> P. Savicky, J. Hlavacova. Measures of Word
> Commonness. Journal of
> Quantitative Linguistics, Vol. 9, 2003, No. 3, pp.
> 215-231. 
> 
> They (1) divide the corpus into same-length
> "pseudodocuments", and count the
> document frequency of the term in each pseudodoc;
> (2) to avoid problems
> cuased by the arbitrary cuts between docs, they
> consider all possible start-
> and end-points for the pseudodocs, and average. 
> We're implementing the
> approach for text-type comparison in the Sketch
> Engine
> http://www.sketchengine.co.uk (and would be
> interested to use your data as a
> test set).
> 
> Third, "other differences between the subcorpora":
> unless the two corpora
> are very well matched in all ways but the text-type
> distinction you are
> interested in, what very often happens is that the
> stats identify some
> different dimension of difference between the
> corpora and that aspect swamps
> out the one you wanted to find.  LOB/Brown was a
> nice test set because the
> corpora are carefully set up to be matched. Even so,
> non-linguistic US vs UK
> differences like cricket vs baseball were nicely
> thrown up by the stats!
> 
> All the best,
> 
> Adam
> 
> 
> -----Original Message-----
> From: owner-corpora at lists.uib.no
> [mailto:owner-corpora at lists.uib.no] On
> Behalf Of Dominic Glennon
> Sent: 06 March 2006 11:09
> To: corpora at lists.uib.no
> Subject: [Corpora-List] Comparing learner
> frequencies with native
> frequencies
> 
> Dear corporistas,
> 
> I'm trying to compare word frequencies in our native
> speaker corpus and
> our learner corpus. Having normalised the
> frequencies in both corpora to
> frequencies per 10 million words, a simple
> subtraction still heavily skews
> the results towards high-frequency words. I've tried
> taking the log of
> both normalised frequencies before subtracting to
> get around the Zipfian
> nature of word frequency distribution - this gives
> better results, but is
> it well-motivated? I'd be grateful for any help you
> could give me, or any
> pointers to previous work done in this area. Many
> thanks,
> 
> Dom
> 
> Dominic Glennon
> Systems Manager
> Cambridge University Press
> 01223 325595
> 
> Search the web's favourite learner dictionaries for
> free at Cambridge
> Dictionaries Online:
> <http://dictionary.cambridge.org>
> 
> 
> 



		
___________________________________________________________ 
NEW Yahoo! Cars - sell your car and browse thousands of new and used cars online! http://uk.cars.yahoo.com/



More information about the Corpora mailing list