36.3111, FYI: October 2025 Newsletter - LDC
The LINGUIST List
linguist at listserv.linguistlist.org
Thu Oct 16 17:05:02 UTC 2025
LINGUIST List: Vol-36-3111. Thu Oct 16 2025. ISSN: 1069 - 4875.
Subject: 36.3111, FYI: October 2025 Newsletter - LDC
Moderator: Steven Moran (linguist at linguistlist.org)
Managing Editor: Valeriia Vyshnevetska
Team: Helen Aristar-Dry, Mara Baccaro, Daniel Swanson
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org
Homepage: http://linguistlist.org
Editor for this issue: Daniel Swanson <daniel at linguistlist.org>
================================================================
Date: 15-Oct-2025
From: Membership Coordinator [ldc at ldc.upenn.edu]
Subject: October 2025 Newsletter - LDC
In this newsletter:
Membership year 2026 publication preview
Fall 2025 data scholarship recipients
New publications:
KAIROS Phase 2 Quizlet
BOLT CTS CALLFRIEND CALLHOME Egyptian Arabic Audio
BOLT CTS CALLFRIEND CALLHOME Egyptian Arabic Transcripts and
Translations
_____________________________________
Membership year 2026 publication preview
The 2026 membership year is approaching and plans for next year’s
publications are in progress. Among the expected releases are:
• 2012 NIST Speaker Recognition Evaluation Test Set: 10,000+ hours of
English conversational telephone speech following the Mixer collection
protocol, used in NIST’s 2012 speaker recognition evaluation
• KAIROS schema learning corpus background data and Phase 1 evaluation
datasets: multimodal English and Spanish source data and annotations
for reasoning about complex real-world events
• CALL MY NET 2: 800+ hours of Tunisian-Arabic conversational
telephone speech from over 400 speakers to support text independent
speaker recognition, used in the 2018 NIST Speaker Recognition
Evaluation
• Multi-language conversational telephone speech: multiple releases,
hundreds of hours of speech from speakers of confusable linguistic
varieties (Arabic, Chinese, English, French, Slavic, Spanish) to
support language identification
• CALLHOME Omnibus releases: combined speech and transcript datasets
with updated directory structure, file formats and documentation, and
lexicons (Chinese, English, German, Japanese, Spanish)
• IARPA MATERIAL language packs: conversational telephone speech,
transcripts, English translations, annotations, and queries in
multiple languages (e.g., Lithuanian, Pashto, Swahili, Tagalog)
Check your inbox for more information about membership renewal.
Fall 2025 data scholarship recipients
Congratulations to the recipients of LDC's Fall 2025 data
scholarships:
Lasidu Dilshan: University of Moratuwa (Sri Lanka): BSc, Electronic
and Telecommunication Engineering. Lasidu is awarded a copy of Asian
Elephant Vocalizations LDC2010S05 for his work in elephant voice
enhancement and classification.
Máté Gedeon: Budapest University of Technology and Economics
(Hungary): PhD candidate, Department of Telecommunications and
Artificial Intelligence. Máté is awarded a copy of Switchboard-1
Release 2 LDC97S62 for his work in simulated conversation generation.
Ping He: Northeastern University (USA): Student, Khoury College of
Computer Sciences. Ping is awarded a copy of ETS Corpus of Non-Native
Written English LDC2014T06 for their work in native language
identification.
Thiyazen Iskander: Maulana Azad College of Arts, Science & Commerce
(India), affiliated with Babasaheb Ambedkar Technological University
(India): PhD candidate, Linguistics, Department of English. Thiyazen
is awarded copies of Arabic Morphological Analyzer (SAMA) Version 3.1
LDC2010L01 and Arabic Treebank Part 1 v. 4.1 LDC2010T13 for his work
in morphosyntactic analysis of short passives in Standard Arabic.
Michael Mooney: University of Glasgow (United Kingdom): PhD candidate,
School of Computing Sciences. Michael is awarded copies of Treebank-2
LDC95T7 and BLLIP 1987-89 WSJ Corpus Release LDC2000T43 for their work
in eye-tracking for text-centered modeling.
Abraham Sanders: Rensselaer Polytechnic Institute (USA): PhD
candidate, Cognitive Science. Abraham is awarded a copy of
Switchboard-1 Release 2 LDC97S62 for his work in spoken dialogue
systems.
_______________________________________
New publications:
KAIROS Phase 2 Quizlet was developed by LDC and contains English and
Spanish text, video and image data, and annotations used for
pre-evaluation research and system development during Phase 2 of the
DARPA KAIROS program. KAIROS Quizlets were a series of narrowly
defined tasks designed to explore specific evaluation objectives
enabling KAIROS system developers to exercise individual system
components on a small data set prior to the full program evaluation.
This corpus contains the complete set of Quizlet data used in Phase 2
which focused on five real-world complex events within the Disease
Outbreak scenario.
Source data was collected from the web; 66 root web pages were
collected and processed, yielding 65 text data files, 890 image files
and 10 video files. Annotation steps included labeling
scenario-relevant events and relations for each document to develop a
structured representation of temporally ordered events, relations and
arguments; generating a reference knowledge graph; and linking labeled
entries to a knowledge base derived from a Wikidata-based ontology.
The DARPA KAIROS (Knowledge-directed Artificial Intelligence Reasoning
Over Schemas) program aimed to build technology capable of
understanding and reasoning about complex real-world events in order
to provide actionable insights to end users. KAIROS systems utilized
formal event representations in the form of schema libraries that
specified the steps, preconditions and constraints for an open set of
complex events; schemas were then used in combination with event
extraction to characterize and make predictions about real-world
events in a large multilingual, multimedia corpus.
2025 members can access this corpus through their LDC accounts.
Non-members may license this data for a fee.
*
BOLT CTS CALLFRIEND CALLHOME Egyptian Arabic Audio was developed by
LDC and consists of 116 hours of speech from 274 unscripted telephone
conversations between native speakers of the Arabic dialect spoken in
Egypt. The calls were collected by LDC in the CALLFRIEND and CALLHOME
series where participants called family members or close friends and
spoke on topics of their choice. Around 33% of the recordings (92
calls) are publicly released for the first time. The remaining 182
recordings were previously published by LDC in various CALLFRIEND,
CALLHOME, and HUB5 Arabic datasets.
The DARPA BOLT (Broad Operational Language Translation) program
developed machine translation and information retrieval for less
formal genres, focusing particularly on user-generated content. LDC
supported the BOLT program by collecting informal data sources --
discussion forums, conversational telephone speech, text messaging,
and chat -- in Chinese, Egyptian Arabic, and English. The material in
this release represents the unannotated Egyptian Arabic source
conversational telephone speech. The telephone data was transcribed,
translated, and annotated for various tasks in the BOLT program
including word alignment, treebanking, and co-reference.
2025 members can access this corpus through their LDC accounts.
Non-members may license this data for a fee.
*
BOLT CTS CALLFRIEND CALLHOME Egyptian Arabic Transcripts and
Translations contains transcripts and corresponding English
translations for the conversational telephone speech in BOLT CTS
CALLFRIEND CALLHOME Egyptian Arabic Audio and was developed by LDC to
support the DARPA BOLT program.
Transcribers were required to produce a verbatim transcript of all
speech within a file using the CODA orthographic approach; diacritics
were not included. Some transcripts contain redactions for potential
personally identifying information. All speech data was transcribed
and is divided into training, development, and evaluation partitions.
The goal of the BOLT translation task was to translate the Arabic
transcripts into fluent English while preserving the meaning present
in the original Arabic text. Transcripts in the development and
evaluation partitions received first pass and gold standard
translations. 99% of the transcripts were translated into English.
2025 members can access this corpus through their LDC accounts.
Non-members may license this data for a fee.
To unsubscribe from this newsletter, log in to your LDC account and
uncheck the box next to “Receive Newsletter” under Account Options or
contact LDC for assistance.
Membership Coordinator
Linguistic Data Consortium
University of Pennsylvania
T: +1-215-573-1275
E: ldc at ldc.upenn.edu
M: 3600 Market St. Suite 810
Philadelphia, PA 19104
Linguistic Field(s): Computational Linguistics
------------------------------------------------------------------------------
********************** LINGUIST List Support ***********************
Please consider donating to the Linguist List, a U.S. 501(c)(3) not for profit organization:
https://www.paypal.com/donate/?hosted_button_id=87C2AXTVC4PP8
LINGUIST List is supported by the following publishers:
Bloomsbury Publishing http://www.bloomsbury.com/uk/
Cambridge University Press http://www.cambridge.org/linguistics
Cascadilla Press http://www.cascadilla.com/
De Gruyter Brill https://www.degruyterbrill.com/?changeLang=en
Edinburgh University Press http://www.edinburghuniversitypress.com
John Benjamins http://www.benjamins.com/
Language Science Press http://langsci-press.org
MIT Press http://mitpress.mit.edu/
Multilingual Matters http://www.multilingual-matters.com/
Narr Francke Attempto Verlag GmbH + Co. KG http://www.narr.de/
Netherlands Graduate School of Linguistics / Landelijke (LOT) http://www.lotpublications.nl/
Peter Lang AG http://www.peterlang.com
----------------------------------------------------------
LINGUIST List: Vol-36-3111
----------------------------------------------------------
More information about the LINGUIST
mailing list