24.2935, FYI: EF Cambridge Open Language Database
linguist at linguistlist.org
linguist at linguistlist.org
Thu Jul 18 18:53:17 UTC 2013
LINGUIST List: Vol-24-2935. Thu Jul 18 2013. ISSN: 1069 - 4875.
Subject: 24.2935, FYI: EF Cambridge Open Language Database
Moderator: Damir Cavar, Eastern Michigan U <damir at linguistlist.org>
Reviews: Veronika Drake, U of Wisconsin Madison
Monica Macaulay, U of Wisconsin Madison
Rajiv Rao, U of Wisconsin Madison
Joseph Salmons, U of Wisconsin Madison
Mateja Schuck, U of Wisconsin Madison
Anja Wanner, U of Wisconsin Madison
<reviews at linguistlist.org>
Homepage: http://linguistlist.org
Do you want to donate to LINGUIST without spending an extra penny? Bookmark
the Amazon link for your country below; then use it whenever you buy from
Amazon!
USA: http://www.amazon.com/?_encoding=UTF8&tag=linguistlist-20
Britain: http://www.amazon.co.uk/?_encoding=UTF8&tag=linguistlist-21
Germany: http://www.amazon.de/?_encoding=UTF8&tag=linguistlistd-21
Japan: http://www.amazon.co.jp/?_encoding=UTF8&tag=linguistlist-22
Canada: http://www.amazon.ca/?_encoding=UTF8&tag=linguistlistc-20
France: http://www.amazon.fr/?_encoding=UTF8&tag=linguistlistf-21
For more information on the LINGUIST Amazon store please visit our
FAQ at http://linguistlist.org/amazon-faq.cfm.
Editor for this issue: Rebekah McClure <rebekah at linguistlist.org>
================================================================
Date: Thu, 18 Jul 2013 14:52:49
From: Dora Alexopoulou [ta259 at cam.ac.uk]
Subject: EF Cambridge Open Language Database
E-mail this message to a friend:
http://linguistlist.org/issues/emailmessage/verification.cfm?iss=24-2935.html&submissionid=17858704&topicid=6&msgnumber=1
New Corpus of L2 English writings: EF Cambridge Open Language Database (EFCamDat)
http://corpus.mml.cam.ac.uk
We are pleased to announce the release of a new resource of L2 English writings, the EF Cambridge Open Language Database (EFCamDat). EFCamDat was developed at the Dept. of Theoretical and Applied Linguistics, at the University of Cambridge in collaboration with EF Education First, an international educational organisation. EFCamDat contains writings submitted to Englishtown, EF’s online school, accessed daily by thousands of learners worldwide. The database currently contains 412K scripts from 76K learners summing up 32 million words. As new data come in, we expect to reach 100 million words by the end of 2014 and be able to follow the longitudinal development of even more students.
Scripts are organised according to EF's proficiency levels and the topic of the writing activity, and contain teachers' corrections and score. In addition, scripts have been annotated automatically with Penn Treebank part-of-speech tags (Marcus et al., 1993) and grammatical relations according to the Stanford Dependency scheme (De Marneffe et al., 2008). Details of the automatic annotation and evaluation of how these tools perform on learner data is presented in Geertzen et al., 2013.
EFCamDat is freely available to the academic community, subject to an end-user agreement protecting copyright. It can be accessed through a web based interface at:
http://corpus.mml.cam.ac.uk/efcamdat
(please click on Frequently Asked Questions to download relevant documentation).
The interface supports selection of scripts from different proficiency levels and by learners of different nationalities and proficiency levels, search for parts of speech and grammatical relations, and export of raw text as well as tagged scripts.
We gratefully acknowledge support by the Isaac Newton Trust, Trinity College, Cambridge, and EF Education First.
Dora Alexopoulou, Rachel Baker, Jeroen Geertzen, Anna Korhonen
References
De Marneffeffe, M. C. and Manning, C. D. (2008). The Stanford typed dependencies representation. In
Coling 2008: Proc. of the Workshop on Cross-Framework and Cross-Domain Parser Evaluation, pages 1–8.
Education First (2012). Englishtown. http://www.englishtown.com/.
Geertzen, J., Alexopoulou, T., and Korhonen, A. (2012). Automatic linguistic annotation of large scale l2 databases: The EF-Cambridge Open Language Database (EFCAMDAT). In in Proceedings of the 31st Second Language Research Forum (SLRF), Carnegie Mel lon. Cascadillla Press.
Marcus, M. P., Marcinkiewicz, M. A., and Santorini, B. (1993). Building a large annotated corpus
of english: The penn treebank. Computational Linguistics, 19(2):313–330.
Linguistic Field(s): Applied Linguistics
Text/Corpus Linguistics
Subject Language(s): English (eng)
----------------------------------------------------------
LINGUIST List: Vol-24-2935
----------------------------------------------------------
More information about the LINGUIST
mailing list