36.2444, FYI: August 2025 Newsletter - LDC

Mon Aug 18 16:05:02 UTC 2025

LINGUIST List: Vol-36-2444. Mon Aug 18 2025. ISSN: 1069 - 4875.

Subject: 36.2444, FYI: August 2025 Newsletter - LDC

Moderator: Steven Moran (linguist at linguistlist.org)
Managing Editor: Valeriia Vyshnevetska
Team: Helen Aristar-Dry, Mara Baccaro, Daniel Swanson
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org

Homepage: http://linguistlist.org

Editor for this issue: Daniel Swanson <daniel at linguistlist.org>

================================================================

Date: 15-Aug-2025
From: Membership Coordinator [ldc at ldc.upenn.edu]
Subject: August 2025 Newsletter - LDC

In this newsletter:
LDC at Interspeech 2025
Fall 2025 LDC data scholarship program
New publications:
Mixer 6 – ChiME 8 Transcribed Calls and Interviews
Abstract Meaning Representation 2.0 – Machine Translations
KAIROS Phase 1 Quizlet
________________________________________
LDC at Interspeech 2025
LDC will be exhibiting at Interspeech 2025, held this year August
17-21 in Rotterdam, the Netherlands. Stop by our booth to say hello
and learn about the latest developments at the Consortium. Also be on
the lookout for the following presentations, posters, and special
sessions featuring LDC work:
Comparative Evaluation of Acoustic Feature Extraction Tools for
Clinical Speech Analysis
Monday, August 18, 11:00-13:00 - Area5-Oral1 - Speech Analysis,
Detection and Classification 1
Reasoning-Based Approach with Chain-of-Thought for Alzheimer’s
Detection Using Speech and Large Language Models
Tuesday, August 19, 13:30-15:30 - Area1-Poster2B - Databases and
Progress in Methodology
Special Session: Challenges in Speech Collection, Curation and
Annotation
Wednesday, August 20, 13:30-15:30 - Area14-SS7 – Part 1
Wednesday, August 20, 16:00-18:00 - Area14-SS8 – Part 2
TELVID: A Multilingual Multi-modal Corpus for Speaker Recognition
Thursday, August 21, 13:30-15:30 - AREA4-Oral8 – Speaker Recognition
LDC also supported the Interspeech 2025 URGENT Challenge which aims to
bring more attention to constructing Universal, Robust, and
Generalizable speech EnhancemeNT models.
LDC will post conference updates via our social media platforms. We
look forward to seeing you in Rotterdam!
Fall 2025 LDC data scholarship program
Student applications for the Fall 2025 LDC data scholarship program
are being accepted now through September 15, 2025. This program
provides eligible students with no-cost access to LDC data. Students
must complete an application consisting of a data use proposal and
letter of support from their advisor. For application requirements and
program rules, visit the LDC Data Scholarships page.
________________________________________
New publications:
Mixer 6 - CHiME 8 Transcribed Calls and Interviews was developed for
the 7th and 8th CHiME (Computational Hearing in Multisource
Environments) challenges. It contains 80 hours of English interviews
and telephone speech from Mixer 6 Speech (LDC2013S03) with transcripts
developed for the CHiME challenges divided into training, development,
and test sets. This data was used in CHiME 7 Task 1 and CHiME 8 Task
1, both of which focused on transcription and segmentation across
varied recording conditions such as interviews, meetings, and dinner
parties, with an emphasis on generalization across recording device
types and array topologies.
The data includes audio from Mixer 6 Speech recorded on 13 microphones
for a total of 1063 hours (corresponding to 80 hours of speech). The
development and test sets are speaker-disjoint from the training data
and consist of fully transcribed, multi-microphone interviews. Each
transcript segment was labeled with the speaker, the uttered text, and
the start and end times in seconds for that segment.
2025 members can access this corpus through their LDC accounts.
Non-members may license this data for a fee.
*
Abstract Meaning Representation 2.0 - Machine Translations was
developed at the University of Edinburgh, School of Informatics and
the University of Zurich, Department of Computational Linguistics. It
consists of Spanish, German, Italian, and Mandarin Chinese automatic
translations of the source English and professionally-translated
Spanish, German, Italian, and Mandarin Chinese sentences in Abstract
Meaning Representation 2.0 - Four Translations (LDC2020T07). The
translations were collected through Google Translate between May 2018
and March 2024.
The source English sentences are a subset (1,371 sentences) of the
sentences contained in Abstract Meaning Representation (AMR)
Annotation Release 2.0 (LDC2017T10), a semantic treebank of over
39,000 English natural language sentences from broadcast
conversations, newswire, and web text.
Translations were from each of the five languages (English, Spanish,
German, Italian, and Mandarin Chinese) to the other four languages
(Spanish, German, Italian, and Mandarin Chinese) covering 20 language
pairs. The dataset contains 1371 source sentences in each language,
each with a professionally translated source sentence and multiple
dated translations by Google Translate.
2025 members can access this corpus through their LDC accounts.
Non-members may license this data for a fee.
*
KARIOS Phase 1 Quizlet was developed by LDC and contains English and
Spanish text, video, and image data and annotations used for
pre-evaluation research and system development during Phase 1 of the
DARPA KAIROS program. KAIROS Quizlets were a series of narrowly
defined tasks designed to explore specific evaluation objectives
enabling KAIROS system developers to exercise individual system
components on a small data set prior to the full program evaluation.
This corpus contains the complete set of Quizlet data used in Phase 1
which focused on two real-world complex events (CEs) within the
Improvised Explosive Device bombing scenario: CE1001 (2018 Caracas
drone attack) and CE1002 (Utah High School backpack bombing).
Source data was collected from the web; 30 root web pages were
collected and processed, yielding 29 text data files, 216 image files
and 5 video files. Annotation steps included labeling
scenario-relevant events and relations for each document to develop a
structured representation of temporally ordered events, relations, and
arguments and generating a reference knowledge graph.
The DARPA KAIROS (Knowledge-directed Artificial Intelligence Reasoning
Over Schemas) program aimed to build technology capable of
understanding and reasoning about complex real-world events in order
to provide actionable insights to end users. KAIROS systems utilized
formal event representations in the form of schema libraries that
specified the steps, preconditions, and constraints for an open set of
complex events; schemas were then used in combination with event
extraction to characterize and make predictions about real-world
events in a large multilingual, multimedia corpus.
2025 members can access this corpus through their LDC accounts.
Non-members may license this data for a fee.

Linguistic Field(s): Computational Linguistics

------------------------------------------------------------------------------

********************** LINGUIST List Support ***********************
Please consider donating to the Linguist List, a U.S. 501(c)(3) not for profit organization:

https://www.paypal.com/donate/?hosted_button_id=87C2AXTVC4PP8

LINGUIST List is supported by the following publishers:

Bloomsbury Publishing http://www.bloomsbury.com/uk/

Cascadilla Press http://www.cascadilla.com/

Edinburgh University Press http://www.edinburghuniversitypress.com

John Benjamins http://www.benjamins.com/

Language Science Press http://langsci-press.org

MIT Press http://mitpress.mit.edu/

Multilingual Matters http://www.multilingual-matters.com/

Narr Francke Attempto Verlag GmbH + Co. KG http://www.narr.de/

Netherlands Graduate School of Linguistics / Landelijke (LOT) http://www.lotpublications.nl/

Peter Lang AG http://www.peterlang.com

----------------------------------------------------------
LINGUIST List: Vol-36-2444
----------------------------------------------------------