[Corpora-List] Comparing learner frequencies with native frequencies
Bayan Shawar
bshawar at yahoo.com
Tue Mar 7 09:29:20 UTC 2006
Dear Dominic,
Paul Rayson, in Lancaster University developed a
Wmatrix tool, which compares between two files,
corpora in three levels, pos, semantic, and lexical.
It accepts even non equal size files and produce
logliklihood.
Rayson, P. (2003). Matrix: a statistical method and
software tool for linguistic analysis through corpus
comparison. Ph.D. thesis. Lancaster University.
I also used this tool within my research for
comparison issues,
Abu Shawar B and Atwell E. 2005. A chatbot system as a
tool to animate acorpus. ICAME Journal. Vol. 29, pp.
5-23.
Hopefully this is useful,
Bayan Abu Shawar
--- Adam Kilgarriff <adam at lexmasterclass.com> wrote:
> Dominic, Abdou,
>
> First, the problem is analogous to
> collocate-finding, so the same range of
> stats such as MI and log likelihood can be used. As
> with collocate-finding,
> there's a balance to be struck between pure,
> mathematical surprisingness,
> and the fact that commoner phenomena are, all else
> being equal, more
> interesting than rarer ones. Not-too-technical
> survey available at
>
http://www.kilgarriff.co.uk/Publications/1996-K-AISB.pdf
>
>
> Second, "burstiness" - words occurring frequently in
> particular documents
> but not much otherwise. If you don't make provision
> for it, many of the
> words thrown up will be 'topic' words used a lot in
> a few texts but not
> interestingly different between the two text types.
> There are plenty of
> ways to address it; survey above describes an
> "adjusted frequency" metric, I
> compared Brown and LOB using document counts and the
> non-parametric
> Mann-Whitney test
>
http://www.kilgarriff.co.uk/Publications/1996-K-CHumBergen-Chisq.txt.
>
>
> Where the docs are all different lengths, it's
> trickier; an elegant general
> solution is given by
>
> P. Savicky, J. Hlavacova. Measures of Word
> Commonness. Journal of
> Quantitative Linguistics, Vol. 9, 2003, No. 3, pp.
> 215-231.
>
> They (1) divide the corpus into same-length
> "pseudodocuments", and count the
> document frequency of the term in each pseudodoc;
> (2) to avoid problems
> cuased by the arbitrary cuts between docs, they
> consider all possible start-
> and end-points for the pseudodocs, and average.
> We're implementing the
> approach for text-type comparison in the Sketch
> Engine
> http://www.sketchengine.co.uk (and would be
> interested to use your data as a
> test set).
>
> Third, "other differences between the subcorpora":
> unless the two corpora
> are very well matched in all ways but the text-type
> distinction you are
> interested in, what very often happens is that the
> stats identify some
> different dimension of difference between the
> corpora and that aspect swamps
> out the one you wanted to find. LOB/Brown was a
> nice test set because the
> corpora are carefully set up to be matched. Even so,
> non-linguistic US vs UK
> differences like cricket vs baseball were nicely
> thrown up by the stats!
>
> All the best,
>
> Adam
>
>
> -----Original Message-----
> From: owner-corpora at lists.uib.no
> [mailto:owner-corpora at lists.uib.no] On
> Behalf Of Dominic Glennon
> Sent: 06 March 2006 11:09
> To: corpora at lists.uib.no
> Subject: [Corpora-List] Comparing learner
> frequencies with native
> frequencies
>
> Dear corporistas,
>
> I'm trying to compare word frequencies in our native
> speaker corpus and
> our learner corpus. Having normalised the
> frequencies in both corpora to
> frequencies per 10 million words, a simple
> subtraction still heavily skews
> the results towards high-frequency words. I've tried
> taking the log of
> both normalised frequencies before subtracting to
> get around the Zipfian
> nature of word frequency distribution - this gives
> better results, but is
> it well-motivated? I'd be grateful for any help you
> could give me, or any
> pointers to previous work done in this area. Many
> thanks,
>
> Dom
>
> Dominic Glennon
> Systems Manager
> Cambridge University Press
> 01223 325595
>
> Search the web's favourite learner dictionaries for
> free at Cambridge
> Dictionaries Online:
> <http://dictionary.cambridge.org>
>
>
>
___________________________________________________________
NEW Yahoo! Cars - sell your car and browse thousands of new and used cars online! http://uk.cars.yahoo.com/
More information about the Corpora
mailing list