26.4036, FYI: Sinica COSPRO & Toolkit Release Announcement

Mon Sep 14 14:32:58 UTC 2015

LINGUIST List: Vol-26-4036. Mon Sep 14 2015. ISSN: 1069 - 4875.

Subject: 26.4036, FYI:  Sinica COSPRO & Toolkit Release Announcement

Moderators: linguist at linguistlist.org (Damir Cavar, Malgorzata E. Cavar)
Reviews: reviews at linguistlist.org (Anthony Aristar, Helen Aristar-Dry, Sara Couture)
Homepage: http://linguistlist.org

*****************    LINGUIST List Support    *****************
Please support the LL editors and operation with a donation at:
              http://funddrive.linguistlist.org/donate/

Editor for this issue: Ashley Parker <ashley at linguistlist.org>
================================================================

Date: Mon, 14 Sep 2015 10:32:29
From: Chiu-yu Tseng [tingyuyou at phslab.ling.sinica.edu.tw]
Subject: Sinica COSPRO & Toolkit Release Announcement

 Database Name: Sinica COSPRO & Toolkit

Database Brief: The Sinica COSPRO (Mandarin Continuous Speech Prosody Corpora) and Toolkit is designed, collected and annotated by Dr. Chiu-yu Tseng and her research group at the Phonetics Lab, Institute of Linguistics, Academia Sinica, Taipei, Taiwan (1994-2005) for development in phonetic research, speech synthesis and recognition.

The corpora include 9 subsets consisting both read and spontaneous speech data by a total of 114 native speakers of Mandarin (53M, 61F). It is 10.5 GB in total, featuring approximately 132 hours of sound files. The reading text is designed in terms of various prosodic phenomena, including word sequences (1-4 word), sentences (declaration, exclamation, interrogation), sentences consisting of random words (“Word Salad”), and paragraphs (85-996 syllables). 

7.7 GB of the database has been annotated, including (1) wav files, (2) transcription of each speaker (*.txt), (3) human-labeled segmental boundaries (*adjusted / *.syl), and (4) human-labeled prosodic boundaries (*.break). The remaining part includes (1) wav files, (2) transcription of each speaker (*.txt), and (3) segmental boundaries auto-labeled by the HTK toolkit (*phn).

Except for HTK force aligned segments, all annotation was perception based and manually tagged by trained transcribers. Both intra- and inter-transcriber consistencies are around 90%. As a result, the tagging provides perceived speech units independent of syntactic structures and semantic relationship.
The COSPRO Toolkit is a Window-based and user-friendly speech analysis software and interface. It integrates commonly accessible speech analysis software, such as Adobe Audition, Praat, and Speech Viewer, into one common platform, and consists of three major functions: (1) performing acoustic analysis, (2) labeling continuous fluent speech and (3) re-synthesizing speech signals.

The intellectual property of the corpora belongs to Academia Sinica, and is therefore under specifications by the Department of Intellectual Property and Technology Transfer, Academia Sinica. The database is available by signing the license agreement and complying with the terms on the license agreement at the Association of Computational Linguistics and Chinese Language Processing (ACLCLP).

To apply for access, please go to ACLCLP COSPRO & Toolkit. (http://www.aclclp.org.tw/use_mat.php#cospro)

Linguistic Field(s): Computational Linguistics
                     Discourse Analysis
                     Phonetics

Subject Language(s): Chinese, Mandarin (cmn)
                     English (eng)

----------------------------------------------------------
LINGUIST List: Vol-26-4036	
----------------------------------------------------------