24.2935, FYI: EF Cambridge Open Language Database

linguist at linguistlist.org linguist at linguistlist.org
Thu Jul 18 18:53:17 UTC 2013


LINGUIST List: Vol-24-2935. Thu Jul 18 2013. ISSN: 1069 - 4875.

Subject: 24.2935, FYI: EF Cambridge Open Language Database

Moderator: Damir Cavar, Eastern Michigan U <damir at linguistlist.org>

Reviews: Veronika Drake, U of Wisconsin Madison
Monica Macaulay, U of Wisconsin Madison
Rajiv Rao, U of Wisconsin Madison
Joseph Salmons, U of Wisconsin Madison
Mateja Schuck, U of Wisconsin Madison
Anja Wanner, U of Wisconsin Madison
       <reviews at linguistlist.org>

Homepage: http://linguistlist.org

Do you want to donate to LINGUIST without spending an extra penny? Bookmark
the Amazon link for your country below; then use it whenever you buy from
Amazon!

USA: http://www.amazon.com/?_encoding=UTF8&tag=linguistlist-20
Britain: http://www.amazon.co.uk/?_encoding=UTF8&tag=linguistlist-21
Germany: http://www.amazon.de/?_encoding=UTF8&tag=linguistlistd-21
Japan: http://www.amazon.co.jp/?_encoding=UTF8&tag=linguistlist-22
Canada: http://www.amazon.ca/?_encoding=UTF8&tag=linguistlistc-20
France: http://www.amazon.fr/?_encoding=UTF8&tag=linguistlistf-21

For more information on the LINGUIST Amazon store please visit our
FAQ at http://linguistlist.org/amazon-faq.cfm.

Editor for this issue: Rebekah McClure <rebekah at linguistlist.org>
================================================================  


Date: Thu, 18 Jul 2013 14:52:49
From: Dora Alexopoulou [ta259 at cam.ac.uk]
Subject: EF Cambridge Open Language Database

E-mail this message to a friend:
http://linguistlist.org/issues/emailmessage/verification.cfm?iss=24-2935.html&submissionid=17858704&topicid=6&msgnumber=1
 
New Corpus of L2 English writings: EF Cambridge Open Language Database (EFCamDat) 
http://corpus.mml.cam.ac.uk

We are pleased to announce the release of a new resource of L2 English writings, the  EF Cambridge Open Language Database (EFCamDat).  EFCamDat  was developed at the Dept. of Theoretical and Applied Linguistics,  at the University of Cambridge in collaboration with EF Education First, an international educational organisation.  EFCamDat contains writings submitted to  Englishtown, EF’s online school, accessed daily by thousands of learners worldwide.  The database  currently contains  412K scripts from 76K learners summing up 32 million words. As new data come in, we expect to reach 100 million words by  the end of 2014 and be able to follow the longitudinal development of even more students. 

Scripts are organised according to EF's proficiency levels and the topic of the writing activity, and contain teachers' corrections and score. In addition, scripts  have   been annotated automatically with  Penn Treebank part-of-speech tags (Marcus et al., 1993) and grammatical relations according to the Stanford Dependency scheme (De Marneffe et al., 2008).  Details of the automatic annotation and  evaluation of how these tools perform on learner data is presented in Geertzen et al., 2013. 

EFCamDat  is freely available to the  academic community,  subject to an end-user agreement protecting copyright.  It can be accessed  through a web based interface at: 

http://corpus.mml.cam.ac.uk/efcamdat

(please click on Frequently Asked Questions to download relevant documentation).

The interface supports selection of scripts from different proficiency levels and by learners of different nationalities and proficiency levels, search for parts of speech and grammatical relations, and export of raw text as well as tagged scripts.  

We gratefully acknowledge support by the Isaac Newton Trust, Trinity College, Cambridge, and EF Education First.

Dora Alexopoulou,  Rachel Baker, Jeroen Geertzen, Anna Korhonen 


References 

De Marneffeffe, M. C. and Manning, C. D. (2008). The Stanford typed dependencies representation. In 
Coling 2008: Proc. of the Workshop on Cross-Framework and Cross-Domain Parser Evaluation, pages 1–8. 

Education First (2012). Englishtown. http://www.englishtown.com/. 

Geertzen, J., Alexopoulou, T., and Korhonen, A. (2012). Automatic linguistic annotation of large scale l2 databases: The EF-Cambridge Open Language Database (EFCAMDAT). In in Proceedings of the 31st Second Language Research Forum (SLRF), Carnegie Mel lon. Cascadillla Press. 

Marcus, M. P., Marcinkiewicz, M. A., and Santorini, B. (1993). Building a large annotated corpus 
of english: The penn treebank. Computational Linguistics, 19(2):313–330. 



Linguistic Field(s): Applied Linguistics
                     Text/Corpus Linguistics

Subject Language(s): English (eng)





 






----------------------------------------------------------
LINGUIST List: Vol-24-2935	
----------------------------------------------------------



More information about the LINGUIST mailing list