Corpora: statistics in learner English

xiaotian guo xiaotiang at hotmail.com
Thu Jan 17 15:06:10 UTC 2002


Dear All

First, let me thank all those who replied to me concerning my request about
"overuse and underuse of learner English" a couple weeks ago.

Currently, I am comparing the frequency data of a learner corpus and native
speaker corpus (they have approximately the same size)and have some
statistical queries. For exampe: For the verb KEEP, I have got the freqency
of the each verb form in the two corpora as follows:

Learner corpus
keep 348 (88.5%)
keeps 15 (3.8%)
keeping 9 (2.3%)
kept 21 (5.4%)
Total 392 (100%)

Native speaker corpus
keep 99 (58.2%)
keeps 14 (8.2%)
keeping 32 (18.8%)
kept 25 (14.7%)
Total 170 (99.9%)

According to the percentage each form takes in its perspective corpus, I can
easily see a large differenc between the use of "keep" in learner corpus and
that in native speaker corpus (88.5%:58.2%). But one problem to my
interpretation is "Why do you think this difference (88.5%:58.2%) is
significant and other differences are not?" I would think there is no way to
answer this question by means of some statistic help because it really
depends on individual circumstances and it will be difficult if not possible
to give a demarcation to such kind of comparison. But to make sure about
this point, I would like to raise this question to the list members.

Someone suggested "chi sqare" to me. But after some initial reading, I found
it can only review the relationship between the observed frequency and
expected frequency and it is based on null hypothesis. It can only tell me
whether there is a significant difference as a whole rather than
individually concerning the use of the different forms of KEEP in the two
corpora. It seems it cannot answer the question I have: why do you think the
use of the base form "keep" is significantly different?

Another query is that if I forget about the problem I just raised and try to
detect differences in two corpora as a whole, what is the best statistic
mothod to use?  Oakes pointed out the weakness of Chi-sqare in Statistics in
Corpus Linguistics:

The Chi-square test is used for the comparison of frequency data. Kilgariff
has shown that this test should be modified when working with corpus data,
since the null hypothesis is always rejected when working with
high-frequency words.

I wonder whether there is another test which could help with corpora
comparison.

With thanks

Guo Xiaotian




_________________________________________________________________
Join the world’s largest e-mail service with MSN Hotmail.
http://www.hotmail.com



More information about the Corpora mailing list