37.665, FYI: February 2026 Newsletter - LDC

Tue Feb 17 15:05:02 UTC 2026

LINGUIST List: Vol-37-665. Tue Feb 17 2026. ISSN: 1069 - 4875.

Subject: 37.665, FYI: February 2026 Newsletter - LDC

Moderator: Steven Moran (linguist at linguistlist.org)
Managing Editor: Valeriia Vyshnevetska
Team: Helen Aristar-Dry, Mara Baccaro, Daniel Swanson
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org

Homepage: http://linguistlist.org

Editor for this issue: Daniel Swanson <daniel at linguistlist.org>

================================================================

Date: 16-Feb-2026
From: Membership Coordinator [ldc at ldc.upenn.edu]
Subject: February 2026 Newsletter - LDC

In this newsletter:
LDC membership discounts expire March 2
Spring 2026 data scholarship recipient
New publications:
2022 NIST Language Recognition Evaluation Test and Development Sets
KAIROS Schema Learning Background Source Data
LORELEI Russian Representative Language Pack
________________________________________
LDC membership discounts expire March 2
Time is running out to save on 2026 membership fees. Renew your LDC
membership, rejoin the Consortium, or become a new member by March 2
to receive a 10% discount. For more information on membership benefits
and options, visit Join LDC.
Spring 2026 data scholarship recipient
Congratulations to the recipient of LDC’s Spring 2026 data
scholarship:
Doma Akshitha Reddy: Chaitanya Bharathi Institute of Technology
(India): Bachelor of Engineering, Information Technology. Doma is
awarded copies of TIMIT Acoustic-Phonetic Continuous Speech Corpus and
The CMU Kids Corpus for their work in child speech.
Since 2010, LDC has awarded scholarships to successful student
applicants twice each year. To date more than 242 corpora have been
distributed to 162 students across 38 countries. We proudly celebrate
their achievements and the contributions their research has made to
the broader community.
The next round of applications will be accepted in September 2026. For
information about the program, visit the Data Scholarships page.
________________________________________
New publications:
2022 NIST Language Recognition Evaluation Test and Development Sets
was developed by LDC and  NIST and contains the test and development
data, metadata, answer keys, and documentation for the 2022 NIST
Language Recognition Evaluation (LRE22). The source data is comprised
of 222 hours of conversational telephone speech (CTS) and broadcast
narrowband speech (BNBS) in 14 languages: Afrikaans, Tunisian Arabic,
Algerian Arabic, Libyan Arabic, South African English, Indian-accented
South African English, North African French, Ndebele, Oromo, Tigrinya,
Tsonga, Venda, Xhosa, and Zulu.
For the CTS collections, a small number of native speakers made single
calls to multiple individuals in their social network. Calls lasted
8-15 minutes; speakers were free to discuss any topic. The BNBS data
was collected from streaming radio programming, focused on broadcasts
that included narrowband speech (e.g., call-ins to a talk show).
Portions of the CTS callee call sides and portions of each broadcast
recording were manually audited by native speakers to verify language
and quality.
LRE22 emphasized language recognition for African languages, including
low resource languages, and expanded the range of test segment
durations. Further information about the 2022 evaluation can be found
in the 2022 NIST Language Recognition Evaluation Plan.
2026 members can access this corpus through their LDC accounts.
Non-members may license this data for a fee.
*
KAIROS Schema Learning Background Source Data was developed by LDC and
includes 14,000 English and Spanish documents representing text,
audio, video, image, and multimedia resources collected during the
DARPA KAIROS program as supplemental background source data for the
KAIROS Schema Learning Corpus (SLC). The purpose of the supplemental
collection was to increase the amount of English and Spanish data with
multimedia components for schema learning and to add domains not well
represented in existing Spanish data. The supplemental data in this
release includes material from the business and logistics domains,
instructional documents and multimedia news.
The complete set of SLC background source data (including the data in
this publication) totaled 16.2 million English, Russian, and Spanish
documents and more than 125,000 audio, video, image, or multimedia
resources. A large portion of that data was drawn from pre-existing
LDC datasets.
The SLC and KAIROS Schema Learning Complex Event Annotation
(LDC2025T07), containing English and Spanish text, audio, video, and
image material labeled for 93 real-world complex events, constitute
the data used by KAIROS system developers for schema learning.
KAIROS systems utilized formal event representations in the form of
schema libraries that specified the steps, preconditions, and
constraints for an open set of complex events; schemas were then used
in combination with event extraction to characterize and make
predictions about real-world events in a large multilingual,
multimedia corpus.
2026 members can access this corpus through their LDC accounts.
Non-members may license this data for a fee.
*
LORELEI Russian Representative Language Pack contains over 1.26
billion words of Russian monolingual text, 360,00 words of which were
translated into English, 3 million words of found Russian-English
parallel text, and 87,000 Russian words translated from English data.
Approximately 83,000 words were annotated for simple named entities,
around 26,000 words were annotated for full entity (including nominals
and pronouns), entity linking and situation frames (identifying
entities, needs, and issues) and nearly 9,000 words were covered by
noun phrase chunking annotation. Data was collected from discussion
forum, news, reference, social network, and weblogs.
The LORELEI (Low Resource Languages for Emergent Incidents) program
was concerned with building human language technology for low resource
languages in the context of emergent situations. Representative
languages were selected to provide broad typological coverage.
The knowledge base for entity linking annotation is available
separately as LORELEI Entity Detection and Linking Knowledge Base
(LDC2020T10).
2026 members can access this corpus through their LDC accounts.
Non-members may license this data for a fee.
To unsubscribe from this newsletter, log in to your LDC account and
uncheck the box next to “Receive Newsletter” under Account Options or
contact LDC for assistance.
Membership Coordinator
Linguistic Data Consortium
University of Pennsylvania
T: +1-215-573-1275
E: ldc at ldc.upenn.edu
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104

Linguistic Field(s): Computational Linguistics

------------------------------------------------------------------------------

********************** LINGUIST List Support ***********************
Please consider donating to the Linguist List, a U.S. 501(c)(3) not for profit organization:

https://www.paypal.com/donate/?hosted_button_id=87C2AXTVC4PP8

LINGUIST List is supported by the following publishers:

Bloomsbury Publishing http://www.bloomsbury.com/uk/

Cambridge University Press http://www.cambridge.org/linguistics

Cascadilla Press http://www.cascadilla.com/

De Gruyter Brill https://www.degruyterbrill.com/?changeLang=en

Edinburgh University Press http://www.edinburghuniversitypress.com

John Benjamins http://www.benjamins.com/

Language Science Press http://langsci-press.org

Lincom GmbH https://lincom-shop.eu/

MIT Press http://mitpress.mit.edu/

Multilingual Matters http://www.multilingual-matters.com/

Narr Francke Attempto Verlag GmbH + Co. KG http://www.narr.de/

Netherlands Graduate School of Linguistics / Landelijke (LOT) http://www.lotpublications.nl/

Peter Lang AG http://www.peterlang.com

SIL International Publications http://www.sil.org/resources/publications

----------------------------------------------------------
LINGUIST List: Vol-37-665
----------------------------------------------------------