[Corpora-List] summary: learner corpora

Barbara Schiftner barbara_schiftner at gmx.net
Tue May 29 23:58:05 UTC 2007


Dear all,


Here's a summary of the information I received in response to my  
inquiry about learner corpora of English and German. I also included  
links to information on other learner corpora I found out about so far.


Many thanks to those of you who replied to my message!


Best,
Barbara




Corpora of Learner English


CLC (Cambridge Learner Corpus)

http://www.cambridge.org/elt/corpus/learner_corpus.htm

CLEC (Chinese Learner English Corpus)

http://langbank.engl.polyu.edu.hk/corpus/clec.html

HKUST (Hong Kong University of Science and Technology)

- currently around 30 million words.
- texts written by university students (mostly Cantonese speakers  
studying Engineering, Science and Business)
- mostly untimed assignments from EFL courses. (400-100 words mostly)  
&  about a million words of school leaving exams
- around 200,000 words are POS-tagged with CLAWS
- the error taxonomy and tags used are loosely described in a revised  
version of John Milton's PhD thesis - see http://repository.ust.hk/ 
dspace/handle/1783.1/1055
- the corpus was used in developing materials and syllabi - including  
development of writing tools such as an interactive grammar guide, a  
learners' online concordancer and a number of 'blended' (online 
+classroom) courses. These were also informed by comparing the  
learners corpus to the writing of students who took the Cambridge  
General Studies examination.

ICLE (International Corpus of Learner English)

http://cecl.fltr.ucl.ac.be/research.html

JEFLL (Japanese EFL Learner)

http://leo.meikai.ac.jp/~tono/

JPU (Janus Pannonius University Corpus)

- corpus blog, http://joeandco.blogspot.com. (221 scrips are there  
and free to use. There is thematic search available via the labels.)
- Jozsef Horvath's PhD dissertation http://www.geocities.com/ 
writing_site/thesis.

LLC (Longman Learners’ Corpus)

http://www.longman-elt.com/dictionaries/corpus/learners.html

MELD (Montclair Electronic Language Database)

http://www.chss.montclair.edu/linguistics/MELD/

Polish Learner English Corpus

http://pelcra.ia.uni.lodz.pl/corpora_en.php

SILS (School of International Liberal Studies at Waseda University)

http://www.f.waseda.jp/vicky/learner/index.html

TeleNex Student Corpus

http://www.telenex.hku.hk/telec/smain/sintro/intro.htm

USE (Uppsala Student English Project)

http://www.engelska.uu.se/use.html




Corpora of Learner German


FALKO (fehlerannotiertes Lernerkorpus des Deutschen als Fremdsprache,  
HU Berlin)

Information on the corpus is available from: http://www2.hu-berlin.de/ 
korpling/projekte/falko/index.php (There is also a web-interface to  
query the corpus.)

LeKo (Lernerkorpus, HU Berlin)

Telecorp (Pennsylvania)

Corpus collected by Ursula Weinberger (Lancaster)

COMET Project

- corpus of Learner English and Learner German (as well as Italian  
and Spanish), collected at the University of São Paulo, Brasil
- http://www.fflch.usp.br/dlm/comet/comaprend.html
- article on the corpus:
http://www.fflch.usp.br/dlm/comet/artigos/A%20multilingual%20learner% 
20corpus%20in%20Brazil.pdf





Corpora of Spoken Learner Language

ISLE

Approx. 20 minutes of speech (per speaker) from 23 German and 23 Italian
intermediate learners of English. Each speaker recorded sentences from
several blocks of differing types (reading simple sentences, using
minimal pairs, giving answers to multiple choice questions). The prompts
were of varying perplexities.
About 2/3 of the data for each speaker was annotated by one of a team of
linguists. The files were corrected first at the word level, and an
automatic recognizer was then used to produce phone-level annotations.
The annotator then re-annotated each sentence to mark phone and stress
errors (e.g., substitutions, insertions, or deletions). Corpus details:
46 speakers (23 German and 23 Italian);  11484 utterances; 1.92  
gigabytes of WAV files (4 CDs); 17 hours, 54 minutes, and 44 seconds  
of speech data. For more details, see:

Menzel, W; Atwell, E; Bonaventura, P; Herron, D; Howarth, P; Morton, R;
Souter, C. The ISLE Corpus of non-native spoken English. in Proc  
LREC2000 vol. 2, pp. 957-964, European Language Resources  
Association. 2000. http://www.comp.leeds.ac.uk/eric/menzel00lrec.pdf

Atwell, Eric; Howarth, Peter; Souter, Clive. The ISLE corpus: Italian
and German spoken learner's English. ICAME Journal, vol. 27, pp. 5-18.
2003. http://www.comp.leeds.ac.uk/eric/atwell03icamej.pdf




______________________________
Barbara Schiftner

Fachdidaktisches Zentrum
Institut fuer Anglistik und Amerikanistik
Universitaet Wien
Spitalgasse 2, Hof 8
A-1090 Wien, AUSTRIA

Tel.: +43-1-4277-424-53
E-Mail: barbara.schiftner at univie.ac.at



More information about the Corpora mailing list