30.2475, FYI: June 2019 Newsletter - LDC

Wed Jun 19 01:26:15 UTC 2019

LINGUIST List: Vol-30-2475. Tue Jun 18 2019. ISSN: 1069 - 4875.

Subject: 30.2475, FYI: June 2019 Newsletter - LDC

Moderator: Malgorzata E. Cavar (linguist at linguistlist.org)
Student Moderator: Jeremy Coburn
Managing Editor: Becca Morris
Team: Helen Aristar-Dry, Everett Green, Sarah Robinson, Peace Han, Nils Hjortnaes, Yiwen Zhang, Julian Dietrich
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org

Homepage: http://linguistlist.org

Please support the LL editors and operation with a donation at:
           https://funddrive.linguistlist.org/donate/

Editor for this issue: Everett Green <everett at linguistlist.org>
================================================================

Date: Tue, 18 Jun 2019 21:16:09
From: Membership Office [ldc at ldc.upenn.edu]
Subject: June 2019 Newsletter - LDC

In this newsletter: 
New Publications:

DEFT Spanish Committed Belief Annotation
USC-SFI MALACH Interviews and Transcripts English – Speech Recognition Edition
First DIHARD Challenge Development - Eight Sources
First DIHARD Challenge Development – SEEDLingS

New publications:

(1) DEFT Spanish Committed Belief Annotation was developed by LDC and consists
of approximately 67,000 tokens of Spanish discussion forum text annotated for
''committed belief,'' which marks the level of commitment displayed by the
author to the truth of the propositions expressed in the text.

DARPA's Deep Exploration and Filtering of Text (DEFT) program aimed to address
remaining capability gaps in state-of-the-art natural language processing
technologies related to inference, causal relationships, and anomaly
detection. LDC supported the DEFT program by collecting, creating, and
annotating a variety of data sources.

DEFT Spanish Committed Belief Annotation is distributed via web download. 

2019 Subscription Members will automatically receive copies of this corpus.
2019 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee. 

*

(2) USC-SFI MALACH Interviews and Transcripts English – Speech Recognition
Edition was developed by IBM as part of the MALACH (Multilingual Access to
Large Spoken ArCHives) Project and contains approximately 168 hours of
interviews from 682 Holocaust witnesses along with transcripts, a lexicon and
other documentation. This release augments USC-SFI MALACH Interviews and
Transcripts English (LDC2012S05) by modifying and updating a subset of the
original corpus for use with speech recognition systems, such as the Kaldi
toolkit. 

Specifically, the audio data has been converted from unsegmented mpeg files to
a segmented flac compressed format. The speaker-turn, time-stamped transcripts
have been updated to an utterance-by-utterance format. A lexicon mapping words
to phonemes is provided, and the data is divided into development and training
sets.

The goal of the MALACH project was to develop methods for improved access to
large multinational spoken archives in order to advance the state of the art
of automatic speech recognition and information retrieval. The characteristics
of the USC-SFI collection -- unconstrained, natural speech filled with
disfluencies, heavy accents, age-related coarticulations, un-cued speaker and
language switching, and emotional speech -- were considered well-suited for
that task. 

USC-SFI MALACH Interviews and Transcripts English – Speech Recognition Edition
is distributed via web download. 

2019 Subscription Members will automatically receive copies of this corpus
provided they have submitted a completed copy of the special license
agreement. 2019 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data at no cost.

*

(3) First DIHARD Challenge Development - Eight Sources was developed by LDC
and contains approximately 17 hours of English and Chinese speech data along
with corresponding annotations used in support of the First DIHARD Challenge.
This release, when combined with First DIHARD Challenge Development -
SEEDLingS (LDC2019S10), contains the development set audio data and annotation
(diarization, segmentation) as well as the official scoring tool.

The First DIHARD Challenge was an attempt to reinvigorate work on diarization
through a shared task focusing on ''hard'' diarization; that is, speech
diarization for challenging corpora where there was an expectation that
existing state-of-the-art systems would fare poorly. As such, it included
speech from a wide sampling of domains representing diversity in number of
speakers, speaker demographics, interaction style, recording quality, and
environmental conditions as follows (all sources are in English unless
otherwise indicated):

- Autism Diagnostic Observation Schedule (ADOS) interviews
- DCIEM/HCRC map task (LDC96S38)
- Audiobook recordings from LibriVox
- Meeting speech from 2004 Spring NIST Rich Transcription (RT-04S) Development
(LDC2007S11) and Evaluation (LDC2007S12) releases.
- 2001 U.S. Supreme Court oral arguments
- Sociolinguistic interviews from SLX Corpus of Classic Sociolinguistic
Interviews (LDC2003T15)
- Chinese video collected by LDC as part of the Video Annotation for Speech
Technologies (VAST) project
- YouthPoint radio interviews

First DIHARD Challenge Development - Eight Sources is distributed via web
download. 

2019 Subscription Members will automatically receive copies of this corpus. 
2019 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee.

*

(4) First DIHARD Challenge Development - SEEDLingS was developed by Duke
University and LDC and contains approximately two hours of English child
language recordings along with corresponding annotations used in support of
the First DIHARD Challenge. This release, when combined with First DIHARD
Challenge Development - Eight Sources (LDC2019S09), contains the development
set audio data and annotation (diarization, segmentation) as well as the
official scoring tool.

The source data was drawn from the SEEDLingS (The Study of Environmental
Effects on Developing Linguistic Skills) corpus, designed to investigate how
infants' early linguistic and environmental input plays a role in their
learning. Recordings for SEEDLingS were generated in the home environment of
44 infants from 6-18 months of age in the Rochester, New York, area. A subset
of that data was annotated by LDC for use in the First DIHARD Challenge.
The First DIHARD Challenge was an attempt to reinvigorate work on diarization
through a shared task focusing on ''hard'' diarization; that is, speech
diarization for challenging corpora where there was an expectation that
existing state-of-the-art systems would fare poorly. As such, it included
speech from a wide sampling of domains representing diversity in number of
speakers, speaker demographics, interaction style, recording quality, and
environmental conditions.

First DIHARD Challenge Development – SEEDLingS is distributed via web
download. 

2019 Subscription Members will receive copies of this corpus provided they
have submitted a completed copy of the special license agreement. 2019
Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee.

*

Membership Office
Linguistic Data Consortium
University of Pennsylvania
T: +1-215-573-1275
E: ldc at ldc.upenn.edu
M: 3600 Market St. Suite 810
Philadelphia, PA 19104

Linguistic Field(s): Computational Linguistics

------------------------------------------------------------------------------

***************************    LINGUIST List Support    ***************************
 The 2019 Fund Drive is under way! Please visit https://funddrive.linguistlist.org
  to find out how to donate and check how your university, country or discipline
     ranks in the fund drive challenges. Or go directly to the donation site:
               https://iufoundation.fundly.com/the-linguist-list-2019

                        Let's make this a short fund drive!
                Please feel free to share the link to our campaign:
                    https://funddrive.linguistlist.org/donate/

----------------------------------------------------------
LINGUIST List: Vol-30-2475	
----------------------------------------------------------