24.99, Confs: Computational Linguistics/USA

linguist at linguistlist.org linguist at linguistlist.org
Wed Jan 9 16:44:11 UTC 2013

LINGUIST List: Vol-24-99. Wed Jan 09 2013. ISSN: 1069 - 4875.

Subject: 24.99, Confs: Computational Linguistics/USA

Moderators: Anthony Aristar, Eastern Michigan U <aristar at linguistlist.org>
            Helen Aristar-Dry, Eastern Michigan U <hdry at linguistlist.org>

Reviews: Veronika Drake, U of Wisconsin Madison
Monica Macaulay, U of Wisconsin Madison
Rajiv Rao, U of Wisconsin Madison
Joseph Salmons, U of Wisconsin Madison
Anja Wanner, U of Wisconsin Madison
       <reviews at linguistlist.org>

Homepage: http://linguistlist.org

Do you want to donate to LINGUIST without spending an extra penny? Bookmark
the Amazon link for your country below; then use it whenever you buy from

USA: http://www.amazon.com/?_encoding=UTF8&tag=linguistlist-20
Britain: http://www.amazon.co.uk/?_encoding=UTF8&tag=linguistlist-21
Germany: http://www.amazon.de/?_encoding=UTF8&tag=linguistlistd-21
Japan: http://www.amazon.co.jp/?_encoding=UTF8&tag=linguistlist-22
Canada: http://www.amazon.ca/?_encoding=UTF8&tag=linguistlistc-20
France: http://www.amazon.fr/?_encoding=UTF8&tag=linguistlistf-21

For more information on the LINGUIST Amazon store please visit our
FAQ at http://linguistlist.org/amazon-faq.cfm.

Editor for this issue: Xiyan Wang <xiyan at linguistlist.org>

Date: Wed, 09 Jan 2013 11:42:01
From: Joel Tetreault [nlisharedtask2013 at gmail.com]
Subject: 1st Shared Task on Native Language Identification

E-mail this message to a friend:
1st Shared Task on Native Language Identification 
Short Title: NLI2013 

Date: 13-Jun-2013 - 14-Jun-2013 
Location: Atlanta, GA, USA 
Contact: Joel Tetreault 
Contact Email: nlisharedtask2013 at gmail.com 
Meeting URL: https://sites.google.com/site/nlisharedtask2013/ 

Linguistic Field(s): Computational Linguistics 

Meeting Description: 

We are excited to organize the first shared task in Native Language Identification (NLI) which is the task of identifying the native language (L1) of a writer based solely on a sample of their writing. The task is framed as a classification problem where the set of L1s is known a priori. Most work has focused on identifying the native language of writers learning English as a second language. This problem has been growing in popularity and has motivated several ACL, NAACL and EMNLP papers, as well as a master’s and doctorate thesis.

Native Language Identification can be useful for a number of applications. First, it can be used in educational settings to provide more targeted feedback to language learners about their errors. It is well known that speakers of different languages make different kinds of errors when learning a language. A writing tutor system which can detect the native language of the learner will be able to tailor the feedback about the error and contrast it with common properties of the learner’s language. Second, native language is often used as a feature that goes into authorship profiling, which is frequently used in forensic linguistics.

The goal of this task is to provide a space to evaluate different techniques and approaches to Native Language Identification. To date, it has been difficult to compare approaches due to issues with training and testing data and a lack of consistency in evaluation standards. In this shared task, we provide a new data set as well provide a framework where different NLI systems can be finally compared. The shared task will be co-located with the 8th Workshop on Innovative Use of NLP for Building Educational Applications on June 13 or 14 in Atlanta, USA:



Educational Testing Service (ETS) is making public 11,000 English essays from the Test of English as a Foreign Language (TOEFL) through the LDC with the motivation to create a larger and more reliable data set for researchers to conduct Native Language Identification experiments on. This set, henceforth TOEFL11, comprises 11 L1s with 1,000 essays per L1. The 11 native languages covered by our corpus are: Arabic, Chinese, French, German, Hindi, Italian, Japanese, Korean, Portuguese, Spanish, Telugu, and Turkish. Furthermore, each essay in the TOEFL11 is labeled with an English language proficiency level (high, medium, or low) based on the judgments of human assessment specialists. The essays are usually 300 to 400 words long. 90% of this set will be sequestered as the training data and the remaining 10% will be released as test data.


The shared task will have three sub-tasks:

- Closed-Training: The first and main task will be the 11 way classification task using only the TOEFL11 for training
- Open-Training-1: The second task will be to allow the use of any amount or type of training data excluding the TOEFL11
- Open-Training-2: The third task will be to allow the use of any amount or type of training data.

The same test data will be used for all sub-tasks.


If you would like to participate in the NLI Shared Task, you need to formally register in order to obtain the training and test data. To register, please send the following information to nlisharedtask2013 at gmail.com:

- Name of Institution or other label appropriate for your team
- Name of contact person for your team
- Email address of contact person for your team


January 14: Training data release
March 11: Test data release
March 18: Submissions due
March 25: Results announcement
April 8: Papers due
April 10: Revision requests sent
April 12: Camera ready version due
June 13 or 14: NLI Shared Task presentations @ BEA8 Workshop


Joel Tetreault, Nuance Communications, USA
Aoife Cahill, Educational Testing Service, USA
Daniel Blanchard, Educational Testing Service, USA

Contact email: nlisharedtask2013 at gmail.com

LINGUIST List: Vol-24-99	

More information about the Linguist mailing list