36.2169, FYI: July 2025 Newsletter - LDC

Tue Jul 15 21:05:02 UTC 2025

LINGUIST List: Vol-36-2169. Tue Jul 15 2025. ISSN: 1069 - 4875.

Subject: 36.2169, FYI: July 2025 Newsletter - LDC

Moderator: Steven Moran (linguist at linguistlist.org)
Managing Editor: Valeriia Vyshnevetska
Team: Helen Aristar-Dry, Mara Baccaro, Daniel Swanson
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org

Homepage: http://linguistlist.org

Editor for this issue: Daniel Swanson <daniel at linguistlist.org>

================================================================

Date: 15-Jul-2025
From: Membership Coordinator [ldc at ldc.upenn.edu]
Subject: July 2025 Newsletter - LDC

In this newsletter:
Fall 2025 LDC data scholarship program
New publications:
AnnoDIFP Session Audio and Transcripts
Penn Parsed Corpora of Historical English Second Release
LoReHLT Uzbek Representative Language Pack
________________________________________
Fall 2025 LDC data scholarship program
Student applications for the Fall 2025 LDC data scholarship program
are being accepted now through September 15, 2025. This program
provides eligible students with no-cost access to LDC data. Students
must complete an application consisting of a data use proposal and
letter of support from their advisor. For application requirements and
program rules, visit the LDC Data Scholarships page.
________________________________________
New publications:
AnnoDIFP (Annotated Data for the Investigation of Facets of
Personality) Session Audio and Transcripts was developed by LDC, the
Florida Institute of Technology (FIT), and the University of New Haven
(UNH) to support algorithm development for predicting personality
traits. It contains 438.34 hours of English audio and transcripts from
in-person interviews of 366 participants paired with scores from two
self-reported personality assessments, HEXACO Personality Inventory
(Revised) (HEXACO-PI-R) and Short Dark Triad (SD3).
In-person interviews were recorded at LDC, FIT, and UNH. In each
session, the participant and interviewer were in separate
sound-isolated rooms with communication between them supplied by
audio/video hardware. Sessions consisted of the following tasks:
rapport building, a YouTube task, a map task, and a business task.
Further details on collection methodology and session tasks are
contained in the documentation accompanying this release.
2025 members can access this corpus through their LDC accounts.
Non-members may license this data for a fee.
*
Penn Parsed Corpora of Historical English Second Release was developed
at the University of Pennsylvania and consists of running texts and
text samples of British English prose from the earliest Middle English
documents (1100 CE) up to the period of the First World War (1914 CE).
This second release corrects errors and inconsistencies in Penn Parsed
Corpora of Historical English (LDC2020T16), further streamlines
annotation, simplifies the directory structure, and includes updated
documentation.
This data set contains three corpora covering traditionally recognized
periods of English:
•       The Penn-Helsinki Parsed Corpus of Middle English, second
edition
•       The Penn-Helsinki Parsed Corpus of Early Modern English
•       The Penn Parsed Corpus of Modern British English, second
edition
The texts are in two forms: part-of-speech tagged text and
syntactically annotated text. Annotations were manually reviewed for
accuracy and consistency. Included in this release are updated
annotation guidelines, philological information for each corpus, and
the CorpusSearch 2 program, which allows users to search the data for
words, word sequences, and syntactic structure.
2025 members can access this corpus through their LDC accounts
provided they have submitted a completed copy of the special license
agreement. Non-members may license this data for a fee.
*
LoReHLT Uzbek Representative Language Pack was developed by LDC and is
comprised of approximately 47 million words of Uzbek monolingual text,
563,000 words of found Uzbek-English parallel text, 100,000 Uzbek
words translated from English data, and 6.4 hours of Uzbek broadcast
news and amateur web audio recordings. Approximately 151, 000 words
were annotated for named entities and over 28,000 words were annotated
for full entity including nominals and pronouns. Noun-phrase chunking
was applied to more than 13,000 words. Over 20,890 words were labeled
with simple semantic annotation. Topic annotation was applied to the
audio recordings. Data was collected from discussion forum, news,
reference, social network, broadcast news, web audio recordings, and
weblogs.
LoReHLT was a companion project of the DARPA LORELEI program. The
LORELEI (Low Resource Languages for Emergent Incidents) program was
concerned with building human language technology for low resource
languages in the context of emergent situations. Representative
languages were selected to provide broad typological coverage.
2025 members can access this corpus through their LDC accounts.
Non-members may license this data for a fee.
To unsubscribe from this newsletter, log in to your LDC account and
uncheck the box next to “Receive Newsletter” under Account Options or
contact LDC for assistance.
Membership Coordinator
Linguistic Data Consortium
University of Pennsylvania
T: +1-215-573-1275
E: ldc at ldc.upenn.edu
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104

Linguistic Field(s): Computational Linguistics

------------------------------------------------------------------------------

********************** LINGUIST List Support ***********************
Please consider donating to the Linguist List to support the student editors:

https://www.paypal.com/donate/?hosted_button_id=87C2AXTVC4PP8

LINGUIST List is supported by the following publishers:

Cascadilla Press http://www.cascadilla.com/

Language Science Press http://langsci-press.org

MIT Press http://mitpress.mit.edu/

Multilingual Matters http://www.multilingual-matters.com/

----------------------------------------------------------
LINGUIST List: Vol-36-2169
----------------------------------------------------------