36.2772, FYI: September 2025 Newsletter - LDC
The LINGUIST List
linguist at listserv.linguistlist.org
Tue Sep 16 13:05:02 UTC 2025
LINGUIST List: Vol-36-2772. Tue Sep 16 2025. ISSN: 1069 - 4875.
Subject: 36.2772, FYI: September 2025 Newsletter - LDC
Moderator: Steven Moran (linguist at linguistlist.org)
Managing Editor: Valeriia Vyshnevetska
Team: Helen Aristar-Dry, Mara Baccaro, Daniel Swanson
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org
Homepage: http://linguistlist.org
Editor for this issue: Daniel Swanson <daniel at linguistlist.org>
================================================================
Date: 15-Sep-2025
From: Membership Coordinator [ldc at ldc.upenn.edu]
Subject: September 2025 Newsletter - LDC
In this newsletter:
LDC data and commercial technology development
New publications:
Mixer 7 English Speech
AIDA Scenario 1 Evaluation Topic Source Data, Annotation and
Assessment
LORELEI Hindi Representative Language Pack
________________________________________
LDC data and commercial technology development
For-profit organizations are reminded that an LDC membership is a
pre-requisite for obtaining a commercial license to almost all LDC
databases. Non-member organizations, including non-member for-profit
organizations, cannot use LDC data to develop or test products for
commercialization, nor can they use LDC data in any commercial product
or for any commercial purpose. LDC data users should consult
corpus-specific license agreements for limitations on the use of
certain corpora. Visit the Licensing page for further information.
________________________________________
New publications:
Mixer 7 English Speech was developed by LDC and contains 12,321 hours
of audio recordings of interviews, transcript readings, and
conversational telephone speech involving 222 distinct English
speakers. This material was collected by LDC in 2010-2011 as part of
the Mixer project, and the recordings were used in the 2012 NIST SRE
test set.
Recruited speakers were connected through a robot operator to carry on
casual conversations on a pre-set topic lasting up to 10 minutes.
Participants also visited LDC’s Human Subjects Collection Lab equipped
with a 14-microphone array where they participated in interviews and
transcript readings, and conducted telephone calls under varying
conditions. Selected speaker metadata was also collected.
2025 members can access this corpus through their LDC accounts. This
corpus is a Members-Only release and is not available for non-member
licensing. Contact ldc at ldc.upenn.edu for information about membership.
*
AIDA Scenario 1 Evaluation Topic Source Data, Annotation and
Assessment was developed by LDC and is comprised of English, Russian,
and Ukrainian web documents (text, video, image), annotations, and
assessments used in the AIDA Phase 1 pilot and final evaluations. The
Phase 1 scenario focused on political relations between Russia and
Ukraine in the 2010s. The material in this corpus covers the following
events: Suspicious Deaths and Murders in Ukraine (January-April 2015);
Odessa Tragedy (May 2, 2014); and Siege of Sloviansk and Battle of
Kramatorsk (April-July 2014).
The corpus contains 10,522 documents, annotations for 386 of those
documents, and assessment results covering 77,965 responses in 1,525
of those documents. Annotations were performed in three steps: (1)
within-document labels for scenario-related entities, relations, and
events; (2) coreference annotation across documents by linking
information elements to a knowledge base; and (3) indications of any
relationship between labeled events/relations and hypotheses about the
scenario. In the assessment phase, LDC annotators reviewed and judged
system response files to provide evaluation organizers with a means
for scoring submissions. Assessment tasks included zero-hop
assessment, class-based assessment, graph assessment, and hypothesis
assessment.
The DARPA AIDA (Active Interpretation of Disparate Alternatives)
program aimed to develop a multi-hypothesis semantic engine to
generate explicit alternative interpretations of events, situations,
and trends from a variety of unstructured sources. LDC supported AIDA
by collecting, creating, and annotating multimodal linguistic
resources in multiple languages.
2025 members can access this corpus through their LDC accounts.
Non-members may license this data for a fee.
*
LORELEI Hindi Representative Language Pack contains over 26 million
words of Hindi monolingual text, 363,00 words of which were translated
into English, 1.07 million words of found Hindi-English parallel text,
and 118,000 Hindi words translated from English data. Approximately
103,000 words were annotated for simple named entities and over 25,000
words were annotated for full entity (including nominals and
pronouns), entity linking, and situation frames (identifying entities,
needs and issues). Data was collected from discussion forum, news,
reference, social network, and weblogs.
The LORELEI (Low Resource Languages for Emergent Incidents) program
was concerned with building human language technology for low resource
languages in the context of emergent situations. Representative
languages were selected to provide broad typological coverage.
The knowledge base for entity linking annotation is available
separately as LORELEI Entity Detection and Linking Knowledge Base
(LDC2020T10).
2025 members can access this corpus through their LDC accounts.
Non-members may license this data for a fee.
To unsubscribe from this newsletter, log in to your LDC account and
uncheck the box next to “Receive Newsletter” under Account Options or
contact LDC for assistance.
Membership Coordinator
Linguistic Data Consortium
University of Pennsylvania
T: +1-215-573-1275
E: ldc at ldc.upenn.edu
M: 3600 Market St. Suite 810
Philadelphia, PA 19104
Linguistic Field(s): Computational Linguistics
------------------------------------------------------------------------------
********************** LINGUIST List Support ***********************
Please consider donating to the Linguist List, a U.S. 501(c)(3) not for profit organization:
https://www.paypal.com/donate/?hosted_button_id=87C2AXTVC4PP8
LINGUIST List is supported by the following publishers:
Bloomsbury Publishing http://www.bloomsbury.com/uk/
Cambridge University Press http://www.cambridge.org/linguistics
Cascadilla Press http://www.cascadilla.com/
De Gruyter Brill https://www.degruyterbrill.com/?changeLang=en
Edinburgh University Press http://www.edinburghuniversitypress.com
John Benjamins http://www.benjamins.com/
Language Science Press http://langsci-press.org
MIT Press http://mitpress.mit.edu/
Multilingual Matters http://www.multilingual-matters.com/
Narr Francke Attempto Verlag GmbH + Co. KG http://www.narr.de/
Netherlands Graduate School of Linguistics / Landelijke (LOT) http://www.lotpublications.nl/
Peter Lang AG http://www.peterlang.com
----------------------------------------------------------
LINGUIST List: Vol-36-2772
----------------------------------------------------------
More information about the LINGUIST
mailing list