29.3207, FYI: August 2018 Newsletter - LDC

Sun Aug 19 00:21:26 UTC 2018

LINGUIST List: Vol-29-3207. Sat Aug 18 2018. ISSN: 1069 - 4875.

Subject: 29.3207, FYI: August 2018 Newsletter - LDC

Moderator: linguist at linguistlist.org (Malgorzata E. Cavar)
Reviews: reviews at linguistlist.org (Helen Aristar-Dry, Robert Coté)
Homepage: https://linguistlist.org

Please support the LL editors and operation with a donation at:
           https://funddrive.linguistlist.org/donate/

Editor for this issue: Everett Green <everett at linguistlist.org>
================================================================

Date: Sat, 18 Aug 2018 20:19:24
From: Membership Office [ldc at ldc.upenn.edu]
Subject: August 2018 Newsletter - LDC

 In this newsletter: 
LDC at Interspeech 2018
Fall 2018 LDC Data Scholarship Program

New Publications:
BOLT English SMS/Chat 
CIEMPIESS Balance 

2011 NIST Language Recognition Evaluation Test Set

LDC at Interspeech 2018
LDC will participate in various ways  at Interspeech 2018 held this year in
Hyderabad, India, September 2-6. It is co-organizing the special session, The
First DIHARD Speech Diarization Challenge, on September 3 and is a sponsor of
the September 1 pre-conference workshop,  Young Female Researchers in Speech
Science & Technology  (YFRSW). Results of recent work will be presented during
the poster session on September 3, “Global TIMIT: Acoustic Phonetic Datasets
for the World’s Languages.”

Fall 2018 LDC Data Scholarship Program
Students can apply for the Fall 2018 Data Scholarship Program now through
September 15, 2018. The LDC Data Scholarship program provides students with
access to LDC data at no cost. For more information on application
requirements and program rules, please visit LDC Data Scholarships: 
https://www.ldc.upenn.edu/language-resources/data/data-scholarships

New publications:

(1) BOLT English SMS/Chat was developed by LDC and consists of
naturally-occurring Short Message Service (SMS) and Chat (CHT) data collected
through data donations and live collection from  native  English speakers. The
corpus contains 18,429 conversations totaling 3,674,802 words across 375,967
messages. 
The BOLT (Broad Operational Language Translation) program developed machine
translation and information retrieval for less formal genres, focusing
particularly on user-generated content. LDC supported the BOLT program by
collecting informal data sources -- discussion forums, text messaging, and
chat -- in Chinese, Egyptian Arabic and English. The collected data was
translated and annotated for various tasks including word alignment,
treebanking, propbanking and co-reference. 
BOLT English SMS/Chat is available via web download.

2018 Subscription Members will receive copies of this corpus. 2018 Standard
Members may request a copy as part of their 16 free membership corpora.
Non-members may license this data for a fee.

*

(2) CIEMPIESS Balance (Corpus de Investigación en Español de México del
Posgrado de Ingeniería Eléctrica y Servicio Social) was developed by the
Development of Speech Technologies program at the School of Engineering at the
National Autonomous University of Mexico (UNAM) and consists of approximately
18 hours of Mexican Spanish broadcast speech with associated transcripts. The
goal of this work was to create acoustic models for automatic speech
recognition. For more information and documentation see the CIEMPIESS-UNAM
Project website.

CIEMPIESS Balance is a companion corpus to CIEMPIESS Light, released by LDC as
LDC2017S23. It was developed so that the data sets together constitute a
gender-balanced corpus. The gender breakdown in CIEMPIESS Light is
approximately 75% male and 25% female. In CIEMPIESS Balance, the gender
breakdown is approximately 25% male and 75% female.

The majority of the speech recordings were collected from Radio-IUS, a UNAM
radio station. Other recordings were taken from IUS Canal Multimedia and
Centro Universitario de Estudios Jurídicos (CUEJ UNAM). These two channels
feature videos with speech around legal issues and topics related to UNAM.

CIEMPIESS Balance is available via web download.

2018 Subscription Members will receive copies of this corpus. 2018 Standard
Members may request a copy as part of their 16 free membership corpora.
Non-members may license this data at no cost.

*

(3) 2011 NIST Language Recognition Evaluation Test Set contains selected
training data and the evaluation test set for the 2011 NIST Language
Recognition Evaluation. It consists of approximately 204 hours of
conversational telephone speech and broadcast audio collected by LDC between
2009 and 2011 in the following 24 languages and dialects: Arabic (Iraqi),
Arabic (Levantine), Arabic (Maghrebi), Arabic (Standard), Bengali, Czech,
Dari, English (American), English (Indian), Farsi, Hindi, Lao, Mandarin,
Punjabi, Pashto, Polish, Russian, Slovak, Spanish, Tamil, Thai, Turkish,
Ukrainian, and Urdu.

The 2011 evaluation emphasized the language pair condition and involved both
conversational telephone speech (CTS) and broadcast narrow-band speech (BNBS).

This release includes training data for nine language varieties that had not
been represented in prior LRE cycles -- Arabic (Iraqi), Arabic (Levantine),
Arabic (Maghrebi), Arabic (Standard), Czech, Lao, Punjabi, Polish and Slovak
-- contained in 893 audited segments of roughly 30 seconds duration and in 400
full-length CTS recordings. The evaluation test set comprises a total of
29,511 audio files, all manually audited at LDC for language and divided
equally into three different test conditions according to the nominal amount
of speech content per segment.

LDC released the prior LREs as:

- 2003 NIST Language Recognition Evaluation (LDC2006S31)
- 2005 NIST Language Recognition Evaluation (LDC2008S05)
- 2007 NIST Language Recognition Evaluation Test Set (LDC2009S04)
- 2007 NIST Language Recognition Evaluation Supplemental Training Set (LDC2009S05)
- 2009 NIST Language Recognition Evaluation Test Set (LDC2014S06)

2011 NIST Language Recognition Evaluation Test Set is distributed via web
download.

2018 Subscription Members will receive copies of this corpus. 2018 Standard
Members may request a copy as part of their 16 free membership corpora.
Non-members may license this data for a fee.

Linguistic Field(s): Computational Linguistics

------------------------------------------------------------------------------

*****************    LINGUIST List Support    *****************
Please support the LL editors and operation with a donation at:

              The IU Foundation Crowd Funding site:
       https://iufoundation.fundly.com/the-linguist-list

               The LINGUIST List FundDrive Page:
            https://funddrive.linguistlist.org/donate/

----------------------------------------------------------
LINGUIST List: Vol-29-3207	
----------------------------------------------------------