36.3512, FYI: November 2025 Newsletter - LDC

Tue Nov 18 18:05:02 UTC 2025

LINGUIST List: Vol-36-3512. Tue Nov 18 2025. ISSN: 1069 - 4875.

Subject: 36.3512, FYI: November 2025 Newsletter - LDC

Moderator: Steven Moran (linguist at linguistlist.org)
Managing Editor: Valeriia Vyshnevetska
Team: Helen Aristar-Dry, Mara Baccaro, Daniel Swanson
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org

Homepage: http://linguistlist.org

Editor for this issue: Daniel Swanson <daniel at linguistlist.org>

================================================================

Date: 17-Nov-2025
From: Membership Coordinator [ldc at ldc.upenn.edu]
Subject: November 2025 Newsletter - LDC

In this newsletter:
Join LDC for membership year 2026
Spring 2026 data scholarship application deadline
New publications:
AnnoDIFP CTS Audio and Transcripts
LORELEI Ilocano Incident Language Pack
________________________________________
Join LDC for membership year 2026
It’s time to renew your LDC membership for 2026. Any organization that
joins the Consortium or renews their membership before March 2, 2026,
will receive a 10% discount off the membership fee.
In addition to accessing new publications, current LDC members enjoy
the benefit of licensing at reduced fees older data from our Catalog
of close to 1000 holdings. Current-year for-profit members may use
most data for commercial applications.
Plans for next year’s publications are in progress. Among the expected
releases are:
- 2012 NIST Speaker Recognition Evaluation Test Set: 10,000+ hours of
English conversational telephone speech following the Mixer collection
protocol, used in NIST’s 2012 speaker recognition evaluation
- KAIROS schema learning corpus background data and Phase 1 evaluation
datasets: multimodal English and Spanish source data and annotations
for reasoning about complex real-world events
- CALL MY NET 2: 800+ hours of Tunisian-Arabic conversational
telephone speech from over 400 speakers to support text independent
speaker recognition, used in the 2018 NIST Speaker Recognition
Evaluation
- Multi-language conversational telephone speech: multiple releases,
hundreds of hours of speech from speakers of confusable linguistic
varieties (Arabic, Chinese, English, French, Slavic, Spanish) to
support language identification
- CALLHOME omnibus releases: combined speech and transcript datasets
with updated directory structure, file formats and documentation, and
lexicons (Chinese, English, German, Japanese, Spanish)
- IARPA MATERIAL language packs: conversational telephone speech,
transcripts, English translations, annotations and queries in multiple
languages (e.g., Lithuanian, Pashto, Swahili, Tagalog)
For full descriptions of all LDC data sets, browse our Catalog. Visit
Join LDC for details on membership, user accounts and payment.
Spring 2026 data scholarship application deadline
Applications are now being accepted through January 15, 2026 for the
Spring 2026 LDC data scholarship program which provides university
students with no-cost access to LDC data. Consult the LDC Data
Scholarships page for more information about program rules and
submission requirements.
________________________________________
New publications:
AnnoDIFP (Annotated Data for the Investigation of Facets of
Personality) CTS (Conversational Telephone Speech) Audio and
Transcripts was developed by LDC, the Florida Institute of Technology
and the University of New Haven to support algorithm development for
predicting personality traits. It contains 242.52 hours of English
telephone audio recordings and transcripts from 1,179 telephone calls
involving 327 participants paired with scores from two self-reported
personality assessments, HEXACO Personality Inventory (Revised)
(HEXACO-PI-R) and Short Dark Triad (SD3).
This corpus contains audio and transcripts for 277 participants and
transcripts only for 50 participants. Telephone calls were collected
using LDC's robot-operator platform. The operator called participants
every 24 hours during their indicated availability and paired them
with another participant to speak on a prompted topic for 10 minutes.
Transcripts were produced automatically using the Rev.ai
speech-to-text service.
2025 members can access this corpus through their LDC accounts.
Non-members may license this data for a fee.
*
LORELEI Ilocano Incident Language Pack was developed by LDC and is
comprised of 8.9 million words of Ilocano monolingual text, 3.3
million words of English monolingual text, 3.2 million words of
parallel Ilocano-English text, and 3 million words annotated for
entity discovery and linking and situation frames. It constitutes all
of the text data, annotations, supplemental resources, and related
software tools for the Ilocano language used in the DARPA LORELEI /
LoReHLT 2019 Evaluation.
The LORELEI (Low Resource Languages for Emergent Incidents) program
was concerned with building human language technology for low resource
languages in the context of emergent situations. In the evaluation
scenario, an unforeseen event triggered a need for humanitarian and
logistical support in a region where the incident language had
received little or no attention in NLP research. Evaluation
participants provided NLP solutions, including information extraction
and machine translation, with limited resources and limited
development time.
Data was collected from news, social network, weblog, newsgroup,
discussion forum, and reference material. Entity discovery and linking
annotation identified entities to be detected by systems for scoring
purposes. Situation frame analysis was designed to extract basic
information about needs and relevant issues for planning a disaster
response effort.
2025 members can access this corpus through their LDC accounts.
Non-members may license this data for a fee.
To unsubscribe from this newsletter, log in to your LDC account and
uncheck the box next to “Receive Newsletter” under Account Options or
contact LDC for assistance.
Membership Coordinator
Linguistic Data Consortium
University of Pennsylvania
T: +1-215-573-1275
E: ldc at ldc.upenn.edu
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104

Linguistic Field(s): Computational Linguistics

------------------------------------------------------------------------------

********************** LINGUIST List Support ***********************
Please consider donating to the Linguist List, a U.S. 501(c)(3) not for profit organization:

https://www.paypal.com/donate/?hosted_button_id=87C2AXTVC4PP8

LINGUIST List is supported by the following publishers:

Bloomsbury Publishing http://www.bloomsbury.com/uk/

Cambridge University Press http://www.cambridge.org/linguistics

Cascadilla Press http://www.cascadilla.com/

De Gruyter Brill https://www.degruyterbrill.com/?changeLang=en

Edinburgh University Press http://www.edinburghuniversitypress.com

John Benjamins http://www.benjamins.com/

Language Science Press http://langsci-press.org

Lincom GmbH https://lincom-shop.eu/

MIT Press http://mitpress.mit.edu/

Multilingual Matters http://www.multilingual-matters.com/

Narr Francke Attempto Verlag GmbH + Co. KG http://www.narr.de/

Netherlands Graduate School of Linguistics / Landelijke (LOT) http://www.lotpublications.nl/

Peter Lang AG http://www.peterlang.com

----------------------------------------------------------
LINGUIST List: Vol-36-3512
----------------------------------------------------------