[Corpora-List] summary: learner corpora
Barbara Schiftner
barbara_schiftner at gmx.net
Tue May 29 23:58:05 UTC 2007
Dear all,
Here's a summary of the information I received in response to my
inquiry about learner corpora of English and German. I also included
links to information on other learner corpora I found out about so far.
Many thanks to those of you who replied to my message!
Best,
Barbara
Corpora of Learner English
CLC (Cambridge Learner Corpus)
http://www.cambridge.org/elt/corpus/learner_corpus.htm
CLEC (Chinese Learner English Corpus)
http://langbank.engl.polyu.edu.hk/corpus/clec.html
HKUST (Hong Kong University of Science and Technology)
- currently around 30 million words.
- texts written by university students (mostly Cantonese speakers
studying Engineering, Science and Business)
- mostly untimed assignments from EFL courses. (400-100 words mostly)
& about a million words of school leaving exams
- around 200,000 words are POS-tagged with CLAWS
- the error taxonomy and tags used are loosely described in a revised
version of John Milton's PhD thesis - see http://repository.ust.hk/
dspace/handle/1783.1/1055
- the corpus was used in developing materials and syllabi - including
development of writing tools such as an interactive grammar guide, a
learners' online concordancer and a number of 'blended' (online
+classroom) courses. These were also informed by comparing the
learners corpus to the writing of students who took the Cambridge
General Studies examination.
ICLE (International Corpus of Learner English)
http://cecl.fltr.ucl.ac.be/research.html
JEFLL (Japanese EFL Learner)
http://leo.meikai.ac.jp/~tono/
JPU (Janus Pannonius University Corpus)
- corpus blog, http://joeandco.blogspot.com. (221 scrips are there
and free to use. There is thematic search available via the labels.)
- Jozsef Horvath's PhD dissertation http://www.geocities.com/
writing_site/thesis.
LLC (Longman Learners’ Corpus)
http://www.longman-elt.com/dictionaries/corpus/learners.html
MELD (Montclair Electronic Language Database)
http://www.chss.montclair.edu/linguistics/MELD/
Polish Learner English Corpus
http://pelcra.ia.uni.lodz.pl/corpora_en.php
SILS (School of International Liberal Studies at Waseda University)
http://www.f.waseda.jp/vicky/learner/index.html
TeleNex Student Corpus
http://www.telenex.hku.hk/telec/smain/sintro/intro.htm
USE (Uppsala Student English Project)
http://www.engelska.uu.se/use.html
Corpora of Learner German
FALKO (fehlerannotiertes Lernerkorpus des Deutschen als Fremdsprache,
HU Berlin)
Information on the corpus is available from: http://www2.hu-berlin.de/
korpling/projekte/falko/index.php (There is also a web-interface to
query the corpus.)
LeKo (Lernerkorpus, HU Berlin)
Telecorp (Pennsylvania)
Corpus collected by Ursula Weinberger (Lancaster)
COMET Project
- corpus of Learner English and Learner German (as well as Italian
and Spanish), collected at the University of São Paulo, Brasil
- http://www.fflch.usp.br/dlm/comet/comaprend.html
- article on the corpus:
http://www.fflch.usp.br/dlm/comet/artigos/A%20multilingual%20learner%
20corpus%20in%20Brazil.pdf
Corpora of Spoken Learner Language
ISLE
Approx. 20 minutes of speech (per speaker) from 23 German and 23 Italian
intermediate learners of English. Each speaker recorded sentences from
several blocks of differing types (reading simple sentences, using
minimal pairs, giving answers to multiple choice questions). The prompts
were of varying perplexities.
About 2/3 of the data for each speaker was annotated by one of a team of
linguists. The files were corrected first at the word level, and an
automatic recognizer was then used to produce phone-level annotations.
The annotator then re-annotated each sentence to mark phone and stress
errors (e.g., substitutions, insertions, or deletions). Corpus details:
46 speakers (23 German and 23 Italian); 11484 utterances; 1.92
gigabytes of WAV files (4 CDs); 17 hours, 54 minutes, and 44 seconds
of speech data. For more details, see:
Menzel, W; Atwell, E; Bonaventura, P; Herron, D; Howarth, P; Morton, R;
Souter, C. The ISLE Corpus of non-native spoken English. in Proc
LREC2000 vol. 2, pp. 957-964, European Language Resources
Association. 2000. http://www.comp.leeds.ac.uk/eric/menzel00lrec.pdf
Atwell, Eric; Howarth, Peter; Souter, Clive. The ISLE corpus: Italian
and German spoken learner's English. ICAME Journal, vol. 27, pp. 5-18.
2003. http://www.comp.leeds.ac.uk/eric/atwell03icamej.pdf
______________________________
Barbara Schiftner
Fachdidaktisches Zentrum
Institut fuer Anglistik und Amerikanistik
Universitaet Wien
Spitalgasse 2, Hof 8
A-1090 Wien, AUSTRIA
Tel.: +43-1-4277-424-53
E-Mail: barbara.schiftner at univie.ac.at
More information about the Corpora
mailing list