27.3360, FYI: August 2016 Newsletter – LDC

The LINGUIST List via LINGUIST linguist at listserv.linguistlist.org
Tue Aug 23 19:00:15 UTC 2016


LINGUIST List: Vol-27-3360. Tue Aug 23 2016. ISSN: 1069 - 4875.

Subject: 27.3360, FYI: August 2016 Newsletter – LDC

Moderators: linguist at linguistlist.org (Damir Cavar, Malgorzata E. Cavar)
Reviews: reviews at linguistlist.org (Anthony Aristar, Helen Aristar-Dry,
                                   Robert Coté, Michael Czerniakowski)
Homepage: http://linguistlist.org

*****************    LINGUIST List Support    *****************
                       Fund Drive 2016
                   25 years of LINGUIST List!
Please support the LL editors and operation with a donation at:
           http://funddrive.linguistlist.org/donate/

Editor for this issue: Kenneth Steimel <ken at linguistlist.org>
================================================================


Date: Tue, 23 Aug 2016 14:59:55
From: LDC LDC [ldc at ldc.upenn.edu]
Subject: August 2016 Newsletter – LDC

 
In this Newsletter:

- Fall 2016 Data Scholarship Program
- LDC at Interspeech 2016

New Publications:

- IARPA Babel Bengali Language Pack
- IARPA Babel Assamese Language Pack
- GALE Phase 3 Arabic Broadcast News Speech Part 1
- GALE Phase 3 Arabic Broadcast News Transcripts Part 1

Fall 2016 LDC Data Scholarship program September 15 deadline approaching:

Student applications for the Fall 2016 LDC Data Scholarship program are being
accepted now through Thursday, September 15, 2016, 11:59PM EST.  The LDC Data
Scholarship program provides university students with access to LDC data at no
cost. Students must complete an application which consists of a data use
proposal and letter of support from their advisor. 

For more information on application requirements and program rules, please
visit the LDC Data Scholarship page. 

Applicants can email their materials to the LDC Data Scholarship program. 

LDC at Interspeech 2016:

LDC will once again be exhibiting at Interspeech, held this year September
9-12 in San Francisco, California. Stop by booth 17 to learn more about recent
developments at the Consortium and new publications.

Also, be on the lookout for the following presentations featuring LDC work:

Automatic Analysis of Phonetic Speech Style Dimensions: Neville Ryant and Mark
Liberman (both LDC) 
Friday 9 September, Oral Session, Bayview A, 11:00am 

The Rhythmic Constraint on Prosodic Boundaries in Mandarin Chinese Based on
Corpora of Silent Reading and Speech Perception: Wei Lai (UPenn), Jiahong Yuan
(LDC), Ya Li (Chinese Academy of Science), Xiaoying Xu (Beijing Normal
University) and Mark Liberman (LDC)
Friday 9 September, Oral Session, Bayview A, 11:00am

Pitch-range Perception: the Dynamic Interaction Between Voice Quality and
Fundamental Frequency: Jianjing Kuang (UPenn) and Mark Liberman (LDC)
Saturday 10 September, Poster Session A, 10:00am 

Phoneme, Phone Boundary, and Tone in Automatic Scoring of Mandarin
Proficiency: Jiahong Yuan and Mark Liberman (both LDC)
Sunday 11 September, Poster Session A, 10:00am 

LDC will post conference updates via our Twitter feed and Facebook page. We
hope to see you there!   

New Publications:

(1) IARPA Babel Bengali Language Pack IARPA-babel103b-v0.4b was developed by
Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel
program. It contains approximately 215 hours of Bengali conversational and
scripted telephone speech collected in 2011 and 2012 along with corresponding
transcripts.

The Babel program focuses on underserved languages and seeks to develop speech
recognition technology that can be rapidly applied to any human language to
support keyword search performance over large amounts of recorded speech.

The Bengali speech in this release represents that spoken in India by native
speakers of Bengali born in India. The gender distribution among speakers is
approximately even; speakers' ages range from 16 years to 65 years. Calls were
made using different telephones (e.g., mobile, landline) from a variety of
environments.

All audio data is presented as 8kHz 8-bit a-law encoded audio in sphere
format. Transcripts are available in two versions: the Bengali script and a
romanization scheme developed by Appen Butler Hill, both encoded in UTF-8. 

2016 Subscription Members will receive two copies of this corpus provided they
have submitted a completed copy of the IARPA User Agreement for Not-for-Profit
Members or the IARPA User Agreement for For-Profit Members. 2016 Standard
Members may request a copy as part of their 16 free membership corpora.
Non-members may license this data for a fee. 

(2) IARPA Babel Assamese Language Pack IARPA-babel102b-v0.5a was developed by
Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel
program. It contains approximately 205 hours of Assamese conversational and
scripted telephone speech collected in 2012 and 2013 along with corresponding
transcripts.

The Babel program focuses on underserved languages and seeks to develop speech
recognition technology that can be rapidly applied to any human language to
support keyword search performance over large amounts of recorded speech.

The speech in this release represents three dialects spoken in Assam, a state
in northeastern India. The gender distribution among speakers is approximately
even; speakers' ages range from 16 years to 66 years. Calls were made using
different telephones (e.g., mobile, landline) from a variety of environments.

All audio data is presented as 8kHz 8-bit a-law encoded audio in sphere
format. Transcripts are available in two versions: Assamese script and a
romanization scheme developed by Appen Butler Hill, both encoded in UTF-8. 

2016 Subscription Members will receive two copies of this corpus provided they
have submitted a completed copy of the IARPA User Agreement for Not-for-Profit
Members or the IARPA User Agreement for For-Profit Members. 2016 Standard
Members may request a copy as part of their 16 free membership corpora.
Non-members may license this data for a fee. 

(3) GALE Phase 3 Arabic Broadcast News Speech Part 1 was developed by LDC and
is comprised of approximately 132 hours of Arabic broadcast news speech
collected in 2007 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco
during Phase 3 of the DARPA GALE (Global Autonomous Language Exploitation)
program.

Corresponding transcripts are released as GALE Phase 3 Arabic Broadcast News
Transcripts Part 1 (LDC2016T17).

The broadcast news recordings in this corpus feature news broadcasts focusing
principally on current events from various broadcast programmers including Abu
Dhabi TV, Al Alam News Channel, Al Arabiya, Al Iraqiyah, Aljazeera, Al
Ordiniyah, Dubai TV, Kuwait TV, Lebanese Broadcast Corporation, Nile TV, Saudi
TV and Syria TV. 

This release contains 175 audio files presented in FLAC-compressed Waveform
Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was
audited by a native Arabic speaker. 

2016 Subscription Members will automatically receive two copies of this
corpus. 2016 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for a fee.

(4) GALE Phase 3 Arabic Broadcast News Transcripts Part 1 was developed by LDC
and contains transcriptions of approximately 132 hours of Arabic broadcast
news speech collected in 2007 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat,
Morocco during Phase 3 of the DARPA GALE (Global Autonomous Language
Exploitation) program.

Corresponding audio data is released as GALE Phase 3 Arabic Broadcast News
Speech Part 1 (LDC2016S07).

The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8
encoding, and the transcribed data totals 741,689 tokens. The transcripts were
created with the LDC tool, XTrans, which supports manual transcription and
annotation of audio recordings. XTrans is available from the following link,
https://www.ldc.upenn.edu/language-resources/tools/xtrans.

The files in this corpus were transcribed by LDC staff and/or by transcription
vendors under contract to LDC. Transcribers followed LDC's quick transcription
guidelines (QTR) and quick rich transcription specification (QRTR) both of
which are included in the documentation with this release. 

2016 Subscription Members will automatically receive two copies of this
corpus. 2016 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for a fee.
 



Linguistic Field(s): Computational Linguistics





 



------------------------------------------------------------------------------

*****************    LINGUIST List Support    *****************
                       Fund Drive 2016
Please support the LL editors and operation with a donation at:
            http://funddrive.linguistlist.org/donate/

        Thank you very much for your support of LINGUIST!
 


----------------------------------------------------------
LINGUIST List: Vol-27-3360	
----------------------------------------------------------







More information about the LINGUIST mailing list