36.1885, FYI: June 2025 Newsletter - LDC
The LINGUIST List
linguist at listserv.linguistlist.org
Wed Jun 18 01:05:02 UTC 2025
LINGUIST List: Vol-36-1885. Wed Jun 18 2025. ISSN: 1069 - 4875.
Subject: 36.1885, FYI: June 2025 Newsletter - LDC
Moderator: Steven Moran (linguist at linguistlist.org)
Managing Editor: Justin Fuller
Team: Helen Aristar-Dry, Steven Franks, Joel Jenkins, Daniel Swanson, Erin Steitz
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org
Homepage: http://linguistlist.org
Editor for this issue: Joel Jenkins <joel at linguistlist.org>
================================================================
Date: 16-Jun-2025
From: Membership Coordinator [ldc at ldc.upenn.edu]
Subject: June 2025 Newsletter - LDC
In this newsletter:
LDC data and commercial technology development
New publications:
Chinese Sentence Pattern Structure Treebank
IWSLT 2022-2023 Shared Task Training, Development and Test Set
KAIROS Schema Learning Complex Event Annotation
________________________________________
LDC data and commercial technology development
For-profit organizations are reminded that an LDC membership is a
pre-requisite for obtaining a commercial license to almost all LDC
databases. Non-member organizations, including non-member for-profit
organizations, cannot use LDC data to develop or test products for
commercialization, nor can they use LDC data in any commercial product
or for any commercial purpose. LDC data users should consult
corpus-specific license agreements for limitations on the use of
certain corpora. Visit the Licensing page for further information.
________________________________________
New publications:
Chinese Sentence Pattern Structure Treebank was developed at Beijing
Normal University and Peking University. It contains 5,016 sentences
and 119,627 tokens syntactically annotated following the concept of
sentence constituent analysis which emphasizes sentence pattern
structure. The source data consists of 27 chapters extracted from
modern Mandarin and ancient Chinese works. There are three annotation
layers: lexical sense and structural mode for dynamic words; syntactic
structure for clauses; and inter-clause relation within complex
sentence and sentence clusters. These structures can be visualized
using the Jbw-viewer tool which is included in the release.
2025 members can access this corpus through their LDC accounts.
Non-members may license this data for a fee.
*
IWSLT 2022 - 2023 Shared Task Training, Development and Test Set was
developed by LDC and contains 210 hours of Tunisian Arabic
conversational telephone speech, transcripts, English translations,
speaker metadata, and documentation. This material constitutes the
training, development, and test data used in the International
Conference on Spoken Language Translation (IWSLT) Dialectal Speech
Translation task (2022) and the Dialectal and Low-resource track
(2023).
The telephone speech was collected by LDC in 2016-2017 from native
speakers of Tunisian Arabic in Tunis. Speakers were recruited to make
telephone calls to people in their social networks from a variety of
noise conditions and handsets. Transcripts are orthographic following
Buckwalter transliteration and cover 175 hours of the collected
speech. IPA transcripts were added to a subset of the data. All
transcribed segments were translated into English.
2025 members can access this corpus through their LDC accounts.
Non-members may license this data for a fee.
*
KAIROS Schema Learning Complex Event Annotation was developed by LDC
to support the DARPA KAIROS program. It contains English and Spanish
text, audio, video, and image data labeled for 93 real-world complex
events with event, relation, and argument annotations linking to
document provenance. Source data was collected from the web; 3431 root
web pages were collected and processed, yielding 1919 text data files,
24019 image files, 1472 video files, and 16 audio files.
The DARPA KAIROS (Knowledge-directed Artificial Intelligence Reasoning
Over Schemas) program aimed to build technology capable of
understanding and reasoning about complex real-world events in order
to provide actionable insights to end users. KAIROS systems utilized
formal event representations in the form of schema libraries that
specified the steps, preconditions, and constraints for an open set of
complex events; schemas were then used in combination with event
extraction to characterize and make predictions about real-world
events in a large, multilingual, multimedia corpus.
2025 members can access this corpus through their LDC accounts.
Non-members may license this data for a fee.
To unsubscribe from this newsletter, log in to your LDC account and
uncheck the box next to “Receive Newsletter” under Account Options or
contact LDC for assistance.
Membership Coordinator
Linguistic Data Consortium
University of Pennsylvania
T: +1-215-573-1275
E: ldc at ldc.upenn.edu
M: 3600 Market St. Suite 810
Philadelphia, PA 19104
Linguistic Field(s): Computational Linguistics
------------------------------------------------------------------------------
********************** LINGUIST List Support ***********************
Please consider donating to the Linguist List to support the student editors:
https://www.paypal.com/donate/?hosted_button_id=87C2AXTVC4PP8
LINGUIST List is supported by the following publishers:
Bloomsbury Publishing http://www.bloomsbury.com/uk/
Cambridge University Press http://www.cambridge.org/linguistics
Cascadilla Press http://www.cascadilla.com/
De Gruyter Mouton https://cloud.newsletter.degruyter.com/mouton
Edinburgh University Press http://www.edinburghuniversitypress.com
Elsevier Ltd http://www.elsevier.com/linguistics
John Benjamins http://www.benjamins.com/
Language Science Press http://langsci-press.org
Lincom GmbH https://lincom-shop.eu/
MIT Press http://mitpress.mit.edu/
Multilingual Matters http://www.multilingual-matters.com/
Netherlands Graduate School of Linguistics / Landelijke (LOT) http://www.lotpublications.nl/
Oxford University Press http://www.oup.com/us
Wiley http://www.wiley.com
----------------------------------------------------------
LINGUIST List: Vol-36-1885
----------------------------------------------------------
More information about the LINGUIST
mailing list