26.4037, FYI: AESOP-ILAS Corpora Release Announcement

Mon Sep 14 14:34:14 UTC 2015

LINGUIST List: Vol-26-4037. Mon Sep 14 2015. ISSN: 1069 - 4875.

Subject: 26.4037, FYI:  AESOP-ILAS Corpora Release Announcement

Moderators: linguist at linguistlist.org (Damir Cavar, Malgorzata E. Cavar)
Reviews: reviews at linguistlist.org (Anthony Aristar, Helen Aristar-Dry, Sara Couture)
Homepage: http://linguistlist.org

*****************    LINGUIST List Support    *****************
Please support the LL editors and operation with a donation at:
              http://funddrive.linguistlist.org/donate/

Editor for this issue: Ashley Parker <ashley at linguistlist.org>
================================================================

Date: Mon, 14 Sep 2015 10:33:36
From: Chiu-yu Tseng [tingyuyou at phslab.ling.sinica.edu.tw]
Subject: AESOP-ILAS Corpora Release Announcement

 Database Name: AESOP-ILAS (Asian English Speech Corpus Project - Institute of Linguistics, Academia Sinica) Corpora

Database Brief: The AESOP-ILAS Corpora are outcome of the Taiwan research team of the multinational consortium AESOP (Asian English Speech Corpus Project). Initiated by Professor Yoshinori Sagisaka of Waseda University in 2008, the consortium aims at collecting common speech data towards better understanding of L2 English features that are common to Asian English in general as well as specific to each participating Asian country in particular.

Participating research teams are committed to (1) record a common set of core data designed by Dr. Chiu-yu Tseng of Academia Sinica and Dr. Tanya Visceglia of National Yangming University and (2) follow the same recording protocols designed by Professor Helen Meng of Hong Kong Chinese University. The AESOP research team consists of research teams from Japan, Hong Kong, Taiwan, China, Thailand, Indonesia and India. Participating members assemble annually under Convener Professor Sagisaka (2008-2013) and Convener Professor Mariko Kondo of Waseda University since 2014 to update data collection and share research findings. 

The AESOP Taiwan team is led by Dr. Chiu-yu Tseng, Distinguished Research Fellow and Director of ILAS (the Institute of Linguistics, Academia Sinica), and features L2 English speech by native speakers of Taiwan Mandarin. The AESOP-ILAS research project was funded by the Chiang Ching-kuo Foundation for International Scholarly Exchange (DB002-D-08. 2009.7.1-2012.12.31.). The project mainly aims at investigating a wide range phonetic and prosodic features in Taiwan L2 English bearing communicative functions at the segmental, lexical, phrasal, and discourse levels, rather than focusing on specific and individual phenomena. The intellectual property of the corpora belongs to Academia Sinica, and is therefore under specifications by the Department of Intellectual Property and Technology Transfer, Academia Sinica.

The AESOP-ILAS Corpora are divided into two parts: AESOP-ILAS 1 featuring the AESOP core data and AESOP-ILAS 2 featuring speech data focusing on prosody properties specific to research projects led by Dr. Chiu-yu Tseng, PI of the Phonetics Lab, ILAS. 

The AESOP-ILAS Corpora are 13.9 GB in total, containing approximately 812 hours of sound files. AESOP-ILAS 1 is 8.58 GB (500 hours), including L1 English speech data by 12 American English native speakers (6M, 6F) and L2 English speech by 488 Taiwan Mandarin speakers (231M, 257F). The recording time of each speaker is approximately 1 hour. Years of L2 speaker’s English training range from 2 to 22 (average 10.5 years). The data content consists of 8 recorded tasks: 6 elicited read speech tasks including reading The North Wind and the Sun passage, 1 fully aided computer-prompted dialogue task, and 1 partially aided picture description task.
AESOP-ILAS 2 is 5.32 GB (312 hours), including L1 English speech data by 10 American English speakers (5M, 5F) and L2 English speech data by 30 Taiwan Mandarin speakers (15M, 15F). The recording time of each L1 speaker is approximately 5.25 hours and 8.7 hours for each L2 speaker. Years of L2 speaker’s English training range from 7 to 30 (average 15.3 years). The data content consists of 5 recorded tasks: 4 elicited read speech tasks (including readings of approximately 5400 high frequency words from the CMU Electronic Dictionary, elicited broad/narrow focus sentences designed by the AESOP-CASS (Chinese Academy of Social Sciences), The Cinderella Passage, and one Taiwan Mandarin task) and 1 fully aided computer-prompted Discourse Completion Task (DCT) modified from the Waseda dataset.

The AESOP-ILAS Corpora were released in April, 2015 through ACLCLP (Association of Computational Linguistics and Chinese Language Processing) for use of non-commercial academic research only. The Corpora should be useful for research and development in language teaching, language modeling, phonetic research and applications to speech synthesis and recognition.

To apply for non-commercial use, please go to ACLCLP AESOP-ILAS Corpora. (http://www.aclclp.org.tw/use_mat.php#aesop).

For commercial applications, please contact Department of Intellectual Property and Technology Transfer, Academia Sinica. (Website: http://otl.sinica.edu.tw/en/ ; Tel: +886-2-2787-2509)

Linguistic Field(s): Computational Linguistics
                     Discourse Analysis
                     Phonetics

Subject Language(s): Chinese, Mandarin (cmn)
                     English (eng)

----------------------------------------------------------
LINGUIST List: Vol-26-4037	
----------------------------------------------------------