28.5338, FYI: December 2017 Newsletter - LDC

Sat Dec 16 17:29:55 UTC 2017

LINGUIST List: Vol-28-5338. Sat Dec 16 2017. ISSN: 1069 - 4875.

Subject: 28.5338, FYI: December 2017 Newsletter - LDC

Moderators: linguist at linguistlist.org (Damir Cavar, Malgorzata E. Cavar)
Reviews: reviews at linguistlist.org (Helen Aristar-Dry, Robert Coté,
                                   Michael Czerniakowski)
Homepage: http://linguistlist.org

Please support the LL editors and operation with a donation at:
           http://funddrive.linguistlist.org/donate/

Editor for this issue: Kenneth Steimel <ken at linguistlist.org>
================================================================

Date: Sat, 16 Dec 2017 12:29:38
From: Membership Coordinator [ldc at ldc.upenn.edu]
Subject: December 2017 Newsletter - LDC

In this newsletter: 

Spring 2018 LDC Data Scholarship Program - deadline approaching

Lingo Boingo: a web portal to language games

Renew your LDC membership today

New Publications:

CHiME3

GALE Phase 4 Chinese Broadcast News Speech

GALE Phase 4 Chinese Broadcast News Transcripts

Spring 2018 LDC Data Scholarship Program - deadline approaching

Students can apply for the Spring 2018 Data Scholarship Program now through
January 15, 2018. The LDC Data Scholarship program provides students with
access to LDC data at no cost. For more information on application
requirements and program rules, please visit LDC Data Scholarships.  

Lingo Boingo: a web portal to language games

LDC is pleased to announce a new collaborative project, Lingo Boingo
(https://lingoboingo.org), a web portal that brings together new and existing
language games that are fun to play and that provide useful annotations and
judgments for linguistic research. Gamers and grammar lovers can choose from a
list of challenging games, which will continue to expand through the efforts
of LDC and external collaborators. For more information, contact
jfiumara at ldc.upenn.edu. Start playing today! 

Renew your LDC membership today

Membership Year 2018 (MY2018) is open for joining and discounts are available
for those who keep their membership current and join early in the year. Now
through March 1, 2018, current MY2017 members who renew before March 1, will
receive a 10% discount off of the membership fee. New or returning
organizations will receive a 5% discount through March 1.  

In addition to receiving new publications, current year LDC members also enjoy
the benefit of licensing older data at reduced costs from our Catalog of over
700 holdings; current year for-profit members may use most data for commercial
applications. Visit Join LDC for details on membership, user accounts and
payment.

Plans for MY2018 publications are in progress. Among the expected releases
are:

- Multilanguage conversational telephone speech: developed to support language
identification research in related languages (Central Asian, Central European
language groups)
- DIRHA (Distant-speech Interaction for Robust Home Applications):  Wall
Street Journal read speech with noise and reverberation, suitable for various
multi-microphone signal processing and distant speech recognition tasks
- TRAD corpora: Chinese-French and Arabic-French parallel text (newswire, web
data)
- IARPA Babel Language Packs (telephone speech and transcripts): languages
include Cebuano, Guarani, Kazakh, Lithuanian, Telugu, Tok Pisin
- BOLT: discussion forum, SMS, word-aligned, and tagged data in all languages
(Egyptian Arabic, English, Chinese)
- DEFT: Spanish Treebank (newswire, web data)
- RATS:  Language Identification data set (Dari, Farsi, Levantine Arabic,
Pashto, Urdu; degraded audio signals)
- TAC KBP: comprehensive English source and entity linked data (broadcast,
telephone speech, newswire, web data)
- German children’s handwriting: longitudinal study of weekly writing in
classroom setting with enhanced output for specific spelling patterns

New publications:

(1) CHiME3 was developed as part of The 3rd CHiME Speech Separation and
Recognition Challenge and contains approximately 342 hours of English speech
and transcripts from noisy environments and 50 hours of noisy environment
audio. The CHiME Challenges focus on distant-microphone automatic speech
recognition (ASR) in real-world environments. CHiME3 involved two types of
data: speech data recorded in very noisy environments (on a bus, in a cafe,
pedestrian area, and street junction) and noisy utterances generated by
artificially mixing clean speech data with noisy backgrounds.

Data is divided into training, development, and test sets. All data is
provided as 16 bit WAV files sampled at 16 kHz. The audio data consists of the
background noises, enhanced speech data using the baseline speech enhancement
technique, unsegmented noisy speech data, and segmented noisy speech data.

LDC has also released two CHiME2 corpora -- CHiME2 Grid and CHiME2 WSJ0.

CHiME3 is distributed via USB drive.

2017 Subscription Members will receive copies of this corpus. 2017 Standard
Members may request a copy as part of their 16 free membership corpora.
Non-members may license this data for a fee.

(2) GALE Phase 4 Chinese Broadcast News Speech was developed by LDC and is
comprised of approximately 134 hours of Mandarin Chinese broadcast news speech
collected in 2008 by LDC and Hong University of Science and Technology
(HKUST), Hong Kong, during Phase 4 of the DARPA GALE (Global Autonomous
Language Exploitation) Program.

Corresponding audio data is released as GALE Phase 4 Chinese Broadcast News
Transcripts (LDC2017T18).

The broadcast news recordings in this release feature news broadcasts focusing
principally on current events from the following sources: China Central TV
(CCTV), a national and international broadcaster in Mainland China; Phoenix
TV, a Hong Kong-based satellite television station; and Voice of America
(VOA), a U.S. government-funded broadcast programmer.

This release contains 256 audio files presented in FLAC-compressed Waveform
Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was
audited by a native Chinese speaker following Audit Procedure Specification
Version 2.0 which is included in this release. 

GALE Phase 4 Chinese Broadcast News Speech is distributed via web download.

2017 Subscription Members will receive copies of this corpus. 2017 Standard
Members may request a copy as part of their 16 free membership corpora.
Non-members may license this data for a fee.

(3) GALE Phase 4 Chinese Broadcast News Transcripts was developed by LDC and
contains transcriptions of approximately 134 hours of Chinese broadcast news
speech collected in 2008 by LDC and Hong University of Science and Technology
(HKUST), Hong Kong, during Phase 4 of the DARPA GALE (Global Autonomous
Language Exploitation) Program.

Corresponding audio data is released as GALE Phase 4 Chinese Broadcast News
Speech (LDC2017S25).

The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8
encoding, and the transcribed data totals 1,696,879 tokens. The transcripts
were created with the LDC tool, XTrans, which supports manual transcription
and annotation of audio recordings. 

GALE Phase 4 Chinese Broadcast News Transcripts is distributed via web
download.

2017 Subscription Members will receive copies of this corpus. 2017 Standard
Members may request a copy as part of their 16 free membership corpora.
Non-members may license this data for a fee.

Membership Office
Linguistic Data Consortium
University of Pennsylvania
T: +1-215-573-1275
E: ldc at ldc.upenn.edu
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104

Linguistic Field(s): Computational Linguistics

------------------------------------------------------------------------------

*****************    LINGUIST List Support    *****************
Please support the LL editors and operation with a donation at:
            http://funddrive.linguistlist.org/donate/

----------------------------------------------------------
LINGUIST List: Vol-28-5338	
----------------------------------------------------------