[Corpora-List] POS-tagging for spoken English and learner English

Thu Jul 21 16:46:05 UTC 2005

Adam,

Folks in UCREL at Lancaster and elsewhere have got some experience of running CLAWS over corpora such as the spoken part of the BNC, MICASE, ICLE, and historical corpora (Nameless Shakespeare). My impression in general is that the statistical HMM component of the tagger provides the robustness you need for these kind of tasks, but you need to accompany that with tweaks to the other components such as the 'idiom' lists and tokenisation. 

Here's some more detail:

1. In the BNC project, the CLAWS transition probabilities were retrained on spoken data. Also there were lexicon additions, special treatment of contractions, truncated words and repetition, all closely tied to the transcription and encoding formats in the BNC spoken corpus. For more detail, see:

Garside, R. (1995) Grammatical tagging of the spoken part of the British National Corpus: a progress report. In Leech, G., Myers, G. and Thomas, J. (eds) (1995), Spoken English on Computer: Transcription, Mark-up and Application. pp.161-7.

Garside, R., and Smith, N. (1997) A hybrid grammatical tagger: CLAWS4, in Garside, R., Leech, G., and McEnery, A. (eds.) Corpus Annotation: Linguistic Information from Computer Text Corpora. Longman, London, pp. 102-121. 

Also see Nicholas Smith and Geoff Leech's manual for BNC version 2, which has error analysis comparing written and spoken:

http://www.comp.lancs.ac.uk/ucrel/bnc2/bnc2error.htm
http://www.comp.lancs.ac.uk/ucrel/claws/

2. I don't have figures for MICASE which we tagged with CLAWS or an ICLE sub-corpus, but came away with the general impression as above that the probability matrix provides robustness in these types of text which you might expect to cause problems for automatic POS annotation. For learner data, of course, POS tagging accuracy depends on how advanced the learners are. You could have a look at

Bertus van Rooy and Lande Schäfer: An evaluation of three POS taggers for the tagging of the Tswana Learner English Corpus

comparing TOSCA-ICLE, Brill tagger, and CLAWS on their data. This was presented at the learner corpus workshop at Corpus Linguistics 2003. The abstract is at
http://tonolab.meikai.ac.jp/~tono/cl2003/lcdda/abstracts/rooy.html
and the full paper is in the CL2003 proceedings.

3. In collaboration with Martin Mueller at Northwestern, we've recently been applying CLAWS to the Nameless Shakespeare corpus and looking at error rates and problems. There are other things which upset CLAWS (and would most likely do the same for other POS taggers) such as different capitalisation and variant spellings. Our approach has been to pre-process these as much as possible, retaining original variants, but fooling CLAWS, if you like, into tagging a version with modern equivalents. See:

Rayson, P., Archer, D. and Smith, N. (2005) VARD versus Word: A comparison of the UCREL variant detector and modern spell checkers on English historical corpora. In proceedings of Corpus Linguistics 2005. 

Our experience with Nameless Shakespeare was that CLAWS' current statistical language model copes pretty well in data from that time, but we expect that the probability matrix will need to be retrained if we attempt tagging data much earlier than 1550/1600.

Regards,
Paul.

Dr. Paul Rayson
Director of UCREL (University Centre for Computer Corpus Research on Language)
Computing Department, Infolab21, South Drive, Lancaster University, Lancaster, LA1 4WA, UK.
Web: http://www.comp.lancs.ac.uk/computing/users/paul/
New telephone number: +44 1524 510357  Fax: +44 1524 510492