[Corpora-List] POS-tagging for spoken English and learner English

Fri Jul 22 08:51:05 UTC 2005

Hi, Adam and colleagues

I agree with Paul in that "For learner data … POS tagging accuracy
depends on how advanced the learners are".

I have tried to have a native speaker corpus, LOCNESS and a learner
corpus COLEC, as I call it, POS tagged. It works perfectly well with
LOCNESS. But unfortunately, I was let down by the inaccuracy of the
tagging to COLEC due to the special features of the learners errors. I
am not a computer person, but I speculate that when a tagging system
is devised, it would be based on the syntax rules most native speakers
abide by. However, non-native speakers, especially those with an
intermediate level or below would not produce the language in the way
native speakers produce.  You can hardly imagine how messy learner
English could be. That would cause a huge problem to the POS tagging
to a learner corpus and very likely indeed would disable the whole
tagging system. Granger discussed this point in her article in

Granger S., Hung J. and Petch-Tyson S. (eds) 2002. Computer Corpora,
Second Language Acquisition and Foreign Language Teaching. Amsterdam:
John Benjamins Publishing Company.

Of course, it does not mean there will be no solutions to this.  If
people try hard enough, they may come up with a better accuracy rate.
As far as I can see (pardon me if I am talking nonsense), at least the
tagging system should not be based on the native speaker syntax rules.
Perhaps the tagging system should be trained with adequate learner
English data? But the problem is that it is hard to find a set of
syntax rules to learner English. Anyway, I will keep all my fingers
crossed for those who are dealing with this part of tagging system
design.

All the best

Xiaotian Guo
PhD Candidate
The Department of English
The University of Birmingham