33.2031, FYI: June 2022 Newsletter - LDC

Thu Jun 16 02:27:02 UTC 2022

LINGUIST List: Vol-33-2031. Thu Jun 16 2022. ISSN: 1069 - 4875.

Subject: 33.2031, FYI: June 2022 Newsletter - LDC

Moderator: Malgorzata E. Cavar (linguist at linguistlist.org)
Student Moderator: Billy Dickson
Managing Editor: Lauren Perkins
Team: Helen Aristar-Dry, Everett Green, Sarah Goldfinch, Nils Hjortnaes,
        Joshua Sims, Billy Dickson, Amalia Robinson, Matthew Fort
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org

Homepage: http://linguistlist.org

Hosted by Indiana University

Please support the LL editors and operation with a donation at:
           https://funddrive.linguistlist.org/donate/

Editor for this issue: Everett Green <everett at linguistlist.org>
================================================================

Date: Thu, 16 Jun 2022 02:26:44
From: Membership Coordinator [ldc at ldc.upenn.edu]
Subject: June 2022 Newsletter - LDC

In this newsletter: 
LDC at LREC 2022
LDC data and commercial technology development
30th Anniversary Highlight: TIMIT 

New publication:
Second DIHARD Challenge Evaluation - Eleven Sources
________________________________________
LDC at LREC 2022
LDC will attend the 13th Language Resource Evaluation Conference (LREC2022),
hosted by ELRA, the European Language Resource Association, in Marseille,
France June 20-25, 2022. Several LDC staff members will be presenting current
work on topics including WeCanTalk: A New Multi-language, Multi-modal Resource
for Speaker Recognition; Reflections on 30 Years of Language Resource
Development and Sharing; A Study in Contradiction: Data and Annotation for
AIDA Focusing on Informational Conflict in Russia-Ukraine Relations; Data
Protection, Privacy and US Regulation; BeSt: The Belief and Sentiment Corpus;
and more.

Stay tuned for specific announcements on LDC’s social media pages regarding
presentation times and locations. Following the conference, LDC’s presented
papers and posters will be available on the Papers Page.

30th Anniversary Highlight: TIMIT 
The TIMIT Acoustic-Phonetic Continuous Speech Corpus is another of the classic
releases in LDC’s Catalog. Designed for the acquisition of acoustic-phonetic
knowledge and for the development and evaluation of automatic speech
recognition systems, it contains recordings of 630 American English speakers
each reading 10 phonetically rich sentences, for a total of 6300 utterances
comprising 2342 distinct sentences. Data collection and annotation were a
joint effort by Texas Instruments, the Massachusetts Institute of Technology,
and SRI International, and the data release was prepared by NIST (National
Institute of Standards and Technology).  

TIMIT was among the first publications that appeared with the launch of LDC’s
catalog in 1993. It remains one of the Consortium’s top ten distributed
corpora and may be the single most widely-used speech database. Despite its
age and small size relative to modern data sets, TIMIT’s wide range of
phonetically-representative inputs, its time-aligned lexical and phonemic
transcripts, and its easy availability through the LDC Catalog have
contributed to its widespread use and continued popularity. Thousands of
researchers remember its famous first sentence: “she had your dark suit in
greasy wash water all year”. 

LDC continues the TIMIT series with its Global TIMIT project which aims to
create a series of corpora in a variety of languages with TIMIT-like features.
(Chanchaochai et al., 2018). Data sets published from that project include:
Global TIMIT Learner Treebank English, Global TIMIT Learner Simple English,
Global TIMIT Mandarin Chinese – Guanzhong Dialect, and Global TIMIT Mandarin
Chinese.  

The LDC Catalog features over 900 holdings in more than 90 languages and more
data is added each year. All TIMIT corpora are available for licensing by
Consortium members and non-members. Visit Obtaining Data for more information.
________________________________________
New publication:
Second DIHARD Challenge Evaluation - Eleven Sources was developed by LDC and
contains approximately 20 hours of English and Chinese speech data along with
corresponding annotations used in support of the Second DIHARD Challenge.

Second DIHARD Challenge Evaluation - Eleven Sources is distributed via web
download.  

2022 Subscription Members will automatically receive copies of this corpus.
2022 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee.

Membership Coordinator
Linguistic Data Consortium
University of Pennsylvania
T: +1-215-573-1275
E: ldc at ldc.upenn.edu
M: 3600 Market St. Suite 810
Philadelphia, PA 19104 

Linguistic Field(s): Computational Linguistics

------------------------------------------------------------------------------

***************************    LINGUIST List Support    ***************************
 The 2020 Fund Drive is under way! Please visit https://funddrive.linguistlist.org
  to find out how to donate and check how your university, country or discipline
     ranks in the fund drive challenges. Or go directly to the donation site:
                   https://crowdfunding.iu.edu/the-linguist-list

                        Let's make this a short fund drive!
                Please feel free to share the link to our campaign:
                    https://funddrive.linguistlist.org/donate/

----------------------------------------------------------
LINGUIST List: Vol-33-2031	
----------------------------------------------------------