Corpora: statistics in learner English

Fri Jan 18 10:24:44 UTC 2002

Hello,

I'm not a statistician but ...

On 17 Jan 2002 at 23:06, xiaotian guo wrote:

> Currently, I am comparing the frequency data of a learner corpus and native
> speaker corpus (they have approximately the same size)and have some
> statistical queries. For exampe: For the verb KEEP, I have got the freqency
> of the each verb form in the two corpora as follows:
>
> Learner corpus
> keep 348 (88.5%)
> keeps 15 (3.8%)
> keeping 9 (2.3%)
> kept 21 (5.4%)
> Total 392 (100%)
>
> Native speaker corpus
> keep 99 (58.2%)
> keeps 14 (8.2%)
> keeping 32 (18.8%)
> kept 25 (14.7%)
> Total 170 (99.9%)
>
> According to the percentage each form takes in its perspective corpus, I can
> easily see a large differenc between the use of "keep" in learner corpus and
> that in native speaker corpus (88.5%:58.2%). But one problem to my
> interpretation is "Why do you think this difference (88.5%:58.2%) is
> significant and other differences are not?"

It is easy to see these frequencies are the highest, so any statistical testing will be
more reliable than in the other cases. BTW - what are you comparing: (normalized)
frequencies or within-corpus percentages? To my mind you really should do both.
Also, do you take account of multi word expressions with KEEP (of which there are
quite a few) - what kind of ultimate answer do you expect to get from comparing one
lemma frequency profiles in two corpora? Often your methodology will be linked to
what and how much you want to be able conclude at the end...

I would think there is no way to
> answer this question by means of some statistic help because it really
> depends on individual circumstances and it will be difficult if not possible
> to give a demarcation to such kind of comparison. But to make sure about
> this point, I would like to raise this question to the list members.
>
> Someone suggested "chi sqare" to me. But after some initial reading, I found
> it can only review the relationship between the observed frequency and
> expected frequency and it is based on null hypothesis. It can only tell me
> whether there is a significant difference as a whole rather than
> individually concerning the use of the different forms of KEEP in the two
> corpora. It seems it cannot answer the question I have: why do you think the
> use of the base form "keep" is significantly different?

Chi-square is often fallible, that is true, especially when you compare high-frequency
words, which almost always display significant differences. Years ago, I tried to follow
Adam Kilgarriff's suggestion to use a variation of the Mann-Whitney ranks test after
slicing my corpora into same-sized 'subcorpora' and then calculating frequencies from
all of them, ordering them by rank and conducting the test. However, in order to be
able to do this one needs sizeable learner & native corpora in the first place. More
details in:

Kilgarriff, A. 1996. "Comparing word frequencies across corpora: Why chi-square
doesn't work, and an improved LOB-Brown comparison" In Proceedings from ALLC-
ACH'96: 169-172.

problem I just raised and try to
> detect differences in two corpora as a whole, what is the best statistic
> mothod to use?  Oakes pointed out the weakness of Chi-sqare in Statistics in
> Corpus Linguistics:
>
> The Chi-square test is used for the comparison of frequency data. Kilgariff
> has shown that this test should be modified when working with corpus data,
> since the null hypothesis is always rejected when working with
> high-frequency words.
>
As above.

BTW, at least in applied linguistics many scholars give up the idea of using precise
statistical metrics because the reliability of the "significance" of the results can often
be called into question - there are just so many variables involved... (sample size,
topic comparability, author age etc etc.). many of us simpy take the percentages and
frequencies and comment upon them.

Hope you find this helpful enough.

Przemek

=======================================
Dr Przemyslaw Kaszubski
t: +48 61 8293515
e: przemka at amu.edu.pl
w: http://elex.amu.edu.pl/ifa/staff/kaszubski.html

(ENGLISH) LEARNER CORPORA PAGE:
http://main.amu.edu.pl/~przemka

COMPREHENSIVE CORPORA BIBLIOGRAPHY:
http://main.amu.edu.pl/~przemka/welcome.html#Corpbibl

School of English
Adam Mickiewicz University
Al. Niepodleglosci 4
61-874 Poznan
t: +48 61 8293506
f: +48 61 8523103
w: http://elex.amu.edu.pl/ifa
=======================================