32.3280, FYI: October 2021 Newsletter - LDC

Tue Oct 19 08:04:19 UTC 2021

LINGUIST List: Vol-32-3280. Tue Oct 19 2021. ISSN: 1069 - 4875.

Subject: 32.3280, FYI: October 2021 Newsletter - LDC

Moderator: Malgorzata E. Cavar (linguist at linguistlist.org)
Student Moderator: Jeremy Coburn, Lauren Perkins
Managing Editor: Becca Morris
Team: Helen Aristar-Dry, Everett Green, Sarah Robinson, Nils Hjortnaes, Joshua Sims, Billy Dickson
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org

Homepage: http://linguistlist.org

Please support the LL editors and operation with a donation at:
           https://funddrive.linguistlist.org/donate/

Editor for this issue: Everett Green <everett at linguistlist.org>
================================================================

Date: Tue, 19 Oct 2021 04:03:59
From: Membership Coordinator [ldc at ldc.upenn.edu]
Subject: October 2021 Newsletter - LDC

In this newsletter: 
Fall 2021 data scholarship recipients
Membership Year 2022 publication preview
LDC data and commercial technology development 

New Publications:
UCLA Variability Speaker Database
BOLT Egyptian Arabic Treebank – SMS/Chat
---
Fall 2021 data scholarship recipients
Congratulations to the recipients of LDC's Fall 2021 data scholarships:
Sophia Minnillo: University of California, Davis (USA); PhD, Linguistics.
Jagabandhu Mishra: Indian Institute of Technology Dharwad (India); Research
Scholar, Electrical Engineering.
Kashyap Patel: University of Texas at Dallas (USA); Ph.D., Electrical
Engineering. 
Yoshani Ranaweera, D. Dissanayaka, S. Sudasinghe: University of Moratuwa (Sri
Lanka); Bachelors, Computer Science and Engineering.
Winie Wong: University of Illinois at Chicago (USA); PhD, Electrical and
Computer Engineering.

For information about the program, visit the Data Scholarships page.

Membership Year 2022 publication preview
The 2022 Membership Year is approaching and plans for next year’s publications
are in progress. Among the expected releases are:
-2017 NIST OpenSAT Pilot – SSSF: real world operational English speech data,
transcripts, and annotation files used in the speech activity detection,
automatic speech recognition, and keyword search tasks of the 2017 OpenSAT
Pilot evaluation
-AttImam: 2000 attribution relations applied to Arabic newswire text from
Arabic Treebank: Part 1 v 4.1 LDC2010T13
-Samrómur Icelandic Speech: 145 hours of Icelandic prompted speech from 8000
speakers covering text from novels, news, plays, and location names
-MASRI Synthetic: 99 hours of synthesized Maltese speech from various genres
with transcripts 
-HAVIC MED Novel Tests: thousands of hours of event and background
user-generated videos with annotation and metadata used for the 2015
Multimedia Event Detection task
-DIHARD Challenges: development and evaluation data from the second and third
DIHARD evaluations, a set of shared tasks focusing on speech diarization for
challenging data
-LORELEI: representative and incident language packs containing monolingual
text, bi-text, translations, annotations, supplemental resources, and related
tools (Kinyarwanda, Wolof)

LDC data and commercial technology development
For-profit organizations are reminded that an LDC membership is a
pre-requisite for obtaining a commercial license to almost all LDC databases.
Non-member organizations, including non-member for-profit organizations,
cannot use LDC data to develop or test products for commercialization, nor can
they use LDC data in any commercial product or for any commercial purpose.
---
New publications:
(1) UCLA Variability Speaker Database was developed by UCLA Speech Processing
and Auditory Perception Laboratory and is comprised of approximately 34 hours
of English speech and orthographic transcripts. Speakers (101 female, 101
male) took part in six tasks: vowel sounds, reading sentences, giving
instructions, neutral conversation, happy conversation, a phone conversation,
annoyed conversation, and responding to a video. This corpus was designed to
sample variability in speaking within individual speakers and across a large
number of speakers. 

(2) BOLT Egyptian Arabic Treebank – SMS/Chat was developed by LDC and consists
of Egyptian Arabic SMS/Chat data with part-of-speech annotation, morphology,
and syntactic tree annotation. This release contains 349,414 tokens before
clitics were split and 435,677 tree tokens after clitics were split for
treebank annotation. The source data was collected by LDC from its collection
platform or by donation and was manually reviewed to exclude material not in
the target language or with sensitive content. Originally written in Arabizi
(Romanized/Latin characters) script, the source SMS/chat text was
transliterated to Arabic script and manually corrected prior to treebank
annotation. Annotations followed Penn Arabic Treebank guidelines.

Linguistic Field(s): Computational Linguistics

------------------------------------------------------------------------------

***************************    LINGUIST List Support    ***************************
 The 2020 Fund Drive is under way! Please visit https://funddrive.linguistlist.org
  to find out how to donate and check how your university, country or discipline
     ranks in the fund drive challenges. Or go directly to the donation site:
                   https://crowdfunding.iu.edu/the-linguist-list

                        Let's make this a short fund drive!
                Please feel free to share the link to our campaign:
                    https://funddrive.linguistlist.org/donate/

----------------------------------------------------------
LINGUIST List: Vol-32-3280	
----------------------------------------------------------