37.1458, FYI: April 2026 Newsletter - LDC
The LINGUIST List
linguist at listserv.linguistlist.org
Wed Apr 15 17:05:02 UTC 2026
LINGUIST List: Vol-37-1458. Wed Apr 15 2026. ISSN: 1069 - 4875.
Subject: 37.1458, FYI: April 2026 Newsletter - LDC
Moderator: Steven Moran (linguist at linguistlist.org)
Managing Editor: Valeriia Vyshnevetska
Team: Helen Aristar-Dry, Mara Baccaro, Daniel Swanson
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org
Homepage: http://linguistlist.org
Editor for this issue: Daniel Swanson <daniel at linguistlist.org>
================================================================
Date: 15-Apr-2026
From: Membership Coordinator [ldc at ldc.upenn.edu]
Subject: April 2026 Newsletter - LDC
In this newsletter:
New publications:
DEFT Chinese and English Light and Rich ERE Parallel Annotation
MATERIAL Tagalog-English Language Pack
LORELEI Somali Representative Language Pack
________________________________________
New publications:
DEFT Chinese and English Light and Rich ERE Parallel Annotation was
developed by LDC and consists of 179 Chinese discussion forum
documents and their English translations annotated for entities,
relations, and events (ERE). Light ERE annotation labels entity
mentions for the target set of entity, relation, and event types
between and among those entities including coreference. Rich ERE
annotation expands types and tagging in the entities, relations, and
events annotation tasks and replaces strict event coreference with a
more loosely defined event hopper annotation. 179 Chinese-English
document pairs were annotated following Light ERE annotation
guidelines; a subset of 171 Chinese-English document pairs were also
labeled with Rich ERE annotation. The source data and English
translations were drawn from BOLT Chinese Discussion Forum Parallel
Training Data (LDC2017T05), originally collected and translated by LDC
under the DARPA BOLT program.
DARPA's Deep Exploration and Filtering of Text (DEFT) program aimed to
address remaining capability gaps in state-of-the-art natural language
processing technologies related to inference, causal relationships,
and anomaly detection. LDC supported the DEFT program by collecting,
creating, and annotating a variety of data sources.
2026 members can access this corpus through their LDC accounts.
Non-members may license this data for a fee.
*
MATERIAL Tagalog-English Language Pack was developed by Appen for the
IARPA MATERIAL program and contains 100 hours of Tagalog
conversational telephone speech, transcripts, English translations,
annotations, and queries. Calls were made using different telephones
(e.g., mobile, landline) from a variety of environments. Transcripts
cover approximately 30% of the speech files, 2% of which were
translated into English. This release also includes domain
annotations, English queries, and their relevance annotations.
The MATERIAL program focused on underserved languages with the
ultimate goal to build cross language information retrieval systems to
find speech and text content using English search queries.
2026 members can access this corpus through their LDC accounts
provided they have submitted a completed copy of the special license
agreement. Non-members may license this data for a fee.
*
LORELEI Somali Representative Language Pack contains over 13 million
words of Somali monolingual text, 800,00 words of which were
translated into English, and 106,000 Somali words translated from
English data. Approximately 73,000 words were annotated for simple
named entities, around 23,000 words were annotated for full entity
(including nominals and pronouns), and over 10,000 words were covered
by noun phrase chunking annotation. Data was collected from discussion
forum, news, reference, social network, and weblogs.
The LORELEI (Low Resource Languages for Emergent Incidents) program
was concerned with building human language technology for low resource
languages in the context of emergent situations. Representative
languages were selected to provide broad typological coverage.
The knowledge base for entity linking annotation is available
separately as LORELEI Entity Detection and Linking Knowledge Base
(LDC2020T10).
2026 members can access this corpus through their LDC accounts.
Non-members may license this data for a fee.
To unsubscribe from this newsletter, log in to your LDC account and
uncheck the box next to “Receive Newsletter” under Account Options or
contact LDC for assistance.
Membership Coordinator
Linguistic Data Consortium
University of Pennsylvania
T: +1-215-573-1275
E: ldc at ldc.upenn.edu
M: 3600 Market St. Suite 810
Philadelphia, PA 19104
Linguistic Field(s): Computational Linguistics
------------------------------------------------------------------------------
********************** LINGUIST List Support ***********************
Please consider donating to the Linguist List, a U.S. 501(c)(3) not for profit organization:
https://www.paypal.com/donate/?hosted_button_id=87C2AXTVC4PP8
LINGUIST List is supported by the following publishers:
Bloomsbury Publishing http://www.bloomsbury.com/uk/
Cambridge University Press http://www.cambridge.org/linguistics
Cascadilla Press http://www.cascadilla.com/
De Gruyter Brill https://www.degruyterbrill.com/?changeLang=en
Edinburgh University Press http://www.edinburghuniversitypress.com
European Language Resources Association (ELRA) http://www.elra.info
John Benjamins http://www.benjamins.com/
Language Science Press http://langsci-press.org
Lincom GmbH https://lincom-shop.eu/
MDPI Languages https://www.mdpi.com/journal/languages
MIT Press http://mitpress.mit.edu/
Multilingual Matters http://www.multilingual-matters.com/
Narr Francke Attempto Verlag GmbH + Co. KG http://www.narr.de/
Netherlands Graduate School of Linguistics / Landelijke (LOT) http://www.lotpublications.nl/
Peter Lang AG http://www.peterlang.com
SIL International Publications http://www.sil.org/resources/publications
----------------------------------------------------------
LINGUIST List: Vol-37-1458
----------------------------------------------------------
More information about the LINGUIST
mailing list