32.1752, FYI: May 2021 Newsletter - LDC

Wed May 19 12:47:37 UTC 2021

LINGUIST List: Vol-32-1752. Wed May 19 2021. ISSN: 1069 - 4875.

Subject: 32.1752, FYI: May 2021 Newsletter - LDC

Moderator: Malgorzata E. Cavar (linguist at linguistlist.org)
Student Moderator: Jeremy Coburn, Lauren Perkins
Managing Editor: Becca Morris
Team: Helen Aristar-Dry, Everett Green, Sarah Robinson, Nils Hjortnaes, Joshua Sims, Billy Dickson
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org

Homepage: http://linguistlist.org

Please support the LL editors and operation with a donation at:
           https://funddrive.linguistlist.org/donate/

Editor for this issue: Everett Green <everett at linguistlist.org>
================================================================

Date: Wed, 19 May 2021 08:46:44
From: Membership Coordinator [ldc at ldc.upenn.edu]
Subject: May 2021 Newsletter - LDC

In this newsletter: 
LDC at ICASSP 2021 

New Publications:
The SSNCE Database of Tamil Dysarthric Speech
ESPADA
BOLT Chinese SMS/Chat Parallel Training Data
________________________________________
LDC at ICASSP 2021
LDC will be exhibiting at ICASSP 2021, held virtually this year June 6-11.
Stop by our digital booth June 8-10 to learn more about recent developments at
the Consortium and new publications.

Also, check out the following poster featuring LDC work:

Probing Acoustic Representations for Phonetic Properties
Wednesday, June 9, 14:00 - 14:45
Session: AUD-11: Auditory Modeling and Hearing Instruments

LDC will post conference links and updates via our Twitter feed and Facebook
page. We hope to “see” you there!
________________________________________

New publications:
(1) The SSNCE Database of Tamil Dysarthric Speech was developed by the Speech
Lab, SSN College of Engineering, India, in collaboration with the Indian
National Institute of Empowerment of Persons with Multiple Disabilities
(NIEPMD) and contains approximately eight hours of Tamil speech data,
time-aligned transcripts and metadata collected from 30 speakers (20
dysarthric speakers and 10 non-dysarthric speakers).

The speech data was collected between 2015 and 2017 in two sessions at NIEPMD.
Each speaker recorded 365 utterances consisting of single words and of
sentences that included a combination of common and uncommon Tamil phrases.
The non-dysarthric speakers were five female and five male subjects. The
dysarthric speakers (7 female, 13 male) reported a diagnosis of cerebral palsy
and ranged in age from 12 years old to 37 years old. 

The SSNCE Database of Tamil Dysarthric Speech is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus
provided they have submitted a completed copy of the special license
agreement. 2021 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for a fee.
*
(2) ESPADA (Extended Syntactic Phrase Alignment DAtaset) consists of annotated
parse trees and alignment on English sentential paraphrases from NIST’s OpenMT
evaluation corpora. It extends SPADE (LDC2018T09) by adding new annotated data
for training/testing phrasal paraphrase detection and phrase representation
models to SPADE's development and test sets. Gold standard annotations of HPSG
(head-driven phrase structure grammar) trees and phrase alignments were
performed, resulting in 251,972 phrase alignments identified in 1,916
sentential paraphrases.

ESPADA is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus.
2021 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee.
*
(3) BOLT Chinese SMS/Chat Parallel Training Data was developed by LDC and
consists of approximately 1.8 million tokens of Chinese SMS/Chat data and
their corresponding English translations.

The source data was donated or collected by LDC via live platforms. Data was
manually selected for translation. Messages/conversations were arranged in
chronological order, segmented into sentence units (all or portions of message
threads depending on their length), and assigned to translation vendors.
Translators followed LDC's BOLT translation guidelines.

BOLT Chinese SMS/Chat Parallel Training Data is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus.
2021 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee.

Membership Coordinator
Linguistic Data Consortium
University of Pennsylvania
T: +1-215-573-1275
E: ldc at ldc.upenn.edu
M: 3600 Market St. Suite 810
Philadelphia, PA 19104

Linguistic Field(s): Computational Linguistics

------------------------------------------------------------------------------

***************************    LINGUIST List Support    ***************************
 The 2020 Fund Drive is under way! Please visit https://funddrive.linguistlist.org
  to find out how to donate and check how your university, country or discipline
     ranks in the fund drive challenges. Or go directly to the donation site:
                   https://crowdfunding.iu.edu/the-linguist-list

                        Let's make this a short fund drive!
                Please feel free to share the link to our campaign:
                    https://funddrive.linguistlist.org/donate/

----------------------------------------------------------
LINGUIST List: Vol-32-1752	
----------------------------------------------------------