35.215, FYI: January 2024 Newsletter - LDC

Wed Jan 17 19:05:02 UTC 2024

LINGUIST List: Vol-35-215. Wed Jan 17 2024. ISSN: 1069 - 4875.

Subject: 35.215, FYI: January 2024 Newsletter - LDC

Moderators: Malgorzata E. Cavar, Francis Tyers (linguist at linguistlist.org)
Managing Editor: Justin Fuller
Team: Helen Aristar-Dry, Steven Franks, Everett Green, Daniel Swanson, Maria Lucero Guillen Puon, Zackary Leech, Lynzie Coburn, Natasha Singh, Erin Steitz
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org

Homepage: http://linguistlist.org

Please support the LL editors and operation with a donation at:
           https://funddrive.linguistlist.org/donate/

Editor for this issue: Justin Fuller <justin at linguistlist.org>
================================================================

Date: 17-Jan-2024
From: Membership Coordinator [ldc at ldc.upenn.edu]
Subject: January 2023 Newsletter - LDC

In this newsletter:
Renew your LDC membership today

New publications:
KASET – Kurmanji and Sorani Kurdish Speech and Transcripts
LORELEI Farsi Representative Language Pack
________________________________________
Renew your LDC membership today
The importance of curated resources for language-related education,
research, and technology development drives LDC’s mission to create
them, to accept data contributions from researchers across the globe,
and to broadly share such resources through the LDC Catalog. LDC
members enjoy no-cost access to new corpora released annually, as well
as the ability to license legacy data sets from among our 950+
holdings at reduced fees. Ensure that your data needs continue to be
met by renewing your LDC membership or by joining the Consortium
today.

Now through March 1, 2024, 2023 members receive a 10% discount on 2024
membership, and new or returning organizations receive a 5% discount.
Membership remains the most economical way to access current and past
LDC releases. Consult Join LDC for more details on membership options
and benefits.
________________________________________
New publications:
KASET - Kurmanji and Sorani Kurdish Speech and Transcripts consists of
147 hours of telephone conversations (289 recordings) and broadcast
news (410 recordings) in two Kurdish dialects: Kurmanji Kurdish and
Sorani Kurdish along with transcripts covering 60 hours of those
recordings. Kurdish is spoken primarily in Turkey, Iran, Iraq, and
Syria. Sorani and Kurmanji are the two widely spoken dialects of the
Kurdish language.

The telephone speech was generated from calls by native Kurdish
speakers in the United States to North American acquaintances in their
social network. The broadcast news audio was collected from multiple
streaming radio and television broadcast programs (narrowband and
wideband audio), many of which contained a mix of Kurmanji and Sorani
Kurdish. Native speaker auditors identified a 5-10 minute span from
each broadcast recording for transcription. Full telephone recordings
that passed the native speaker audit were transcribed. This release
includes speaker information, such as gender, year of birth, and
language.

2024 members can access this corpus through their LDC accounts.
Non-members may license this data for a fee.

*

LORELEI Farsi Representative Language Pack was developed by LDC and is
comprised of approximately 250 million words of Farsi monolingual
text, 120,000 Farsi words translated from English data, and 751,000
words of found Farsi-English parallel text. Approximately 75,000 words
were annotated for named entities and up to 22,000 words were
annotated for entity discovery and linking and situation frames
(identifying entities, needs, and issues). Data was collected from
discussion forum, news, reference, social network, and weblogs.

The LORELEI (Low Resource Languages for Emergent Incidents) program
was concerned with building human language technology for low resource
languages in the context of emergent situations. Representative
languages were selected to provide broad typological coverage.

The knowledge base for entity linking annotation is available
separately as LORELEI Entity Detection and Linking Knowledge Base
(LDC2020T10).

2024 members can access this corpus through their LDC accounts.
Non-members may license this data for a fee.

To unsubscribe from this newsletter, log in to your LDC account and
uncheck the box next to “Receive Newsletter” under Account Options or
contact LDC for assistance.

Membership Coordinator
Linguistic Data Consortium
University of Pennsylvania
T: +1-215-573-1275
E: ldc at ldc.upenn.edu
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104

Linguistic Field(s): Computational Linguistics

------------------------------------------------------------------------------

Please consider donating to the Linguist List https://give.myiu.org/iu-bloomington/I320011968.html

LINGUIST List is supported by the following publishers:

Cambridge University Press http://www.cambridge.org/linguistics

Multilingual Matters http://www.multilingual-matters.com/

Wiley http://www.wiley.com

----------------------------------------------------------
LINGUIST List: Vol-35-215
----------------------------------------------------------