34.289, FYI: January 2023 Newsletter - LDC

The LINGUIST List linguist at listserv.linguistlist.org
Thu Jan 26 02:05:06 UTC 2023


LINGUIST List: Vol-34-289. Thu Jan 26 2023. ISSN: 1069 - 4875.

Subject: 34.289, FYI: January 2023 Newsletter - LDC

Moderator: Malgorzata E. Cavar, Francis Tyers (linguist at linguistlist.org)
Managing Editor: Lauren Perkins
Team: Helen Aristar-Dry, Steven Franks, Everett Green, Sarah Robinson, Joshua Sims, Jeremy Coburn, Daniel Swanson, Matthew Fort, Maria Lucero Guillen Puon, Billy Dickson
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org

Homepage: http://linguistlist.org

Please support the LL editors and operation with a donation at:
           https://funddrive.linguistlist.org/donate/

Editor for this issue: Everett Green <everett at linguistlist.org>
================================================================


Date: 
From: Membership Coordinator [ldc at ldc.upenn.edu]
Subject: January 2023 Newsletter - LDC


In this newsletter:
Renew your LDC membership today

New publications:
AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts
LORELEI Swahili Representative Language Pack
________________________________________
Renew your LDC membership today
The importance of curated resources for language-related education,
research, and technology development drives LDC’s mission to create
them, to accept data contributions from researchers across the globe,
and to broadly share such resources through the LDC Catalog. LDC
members enjoy no-cost access to new corpora released annually, as well
as the ability to license legacy data sets from among our 925+
holdings at reduced fees. Ensure that your data needs continue to be
met by renewing your LDC membership or by joining the Consortium
today.

Now through March 1, 2023, 2022 members receive a 10% discount on 2023
membership, and new or returning organizations receive a 5% discount.
Membership remains the most economical way to access current and past
LDC releases. Consult Join LDC for more details on membership options
and benefits.
________________________________________
AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts
and is comprised of approximately 156 hours of Ukrainian
conversational telephone speech and broadcast news audio with 1.2
million words of corresponding orthographic transcripts.

The news audio data was taken from 87 recordings broadcast by various
Ukrainian sources. The telephone speech was generated from telephone
calls by native Ukrainian speakers to acquaintances in their social
network. Native Ukrainian speakers manually segmented the data into
sentence-level units as part of the transcription process.

The broadcast recordings and transcripts were produced by LDC to
support the DARPA AIDA (Active Interpretation of Disparate
Alternatives) program which aimed to develop a multi-hypothesis
semantic engine to generate explicit alternative interpretations of
events, situations, and trends from a variety of unstructured sources.
The telephone speech audio recordings were collected by LDC to support
the NIST 2011 Language Recognition Evaluation  and are also contained
in Multi-Language Conversational Telephone Speech 2011 – Slavic Group
LDC2016S11.

2023 members can access this corpus through their LDC accounts.
Non-members may license this data for a fee.

*

LORELEI Swahili Representative Language Pack was developed by LDC and
is comprised of approximately 4.3 million words of Swahili monolingual
text, 90,000 Swahili words translated from English data, and 545,000
words of found Swahili-English parallel text. Approximately 100,000
words were annotated for named entities and up to 26,000 words were
annotated for entity discovery and linking and situation frames
(identifying entities, needs and issues). Data was collected from
discussion forum, news, reference, social network, and weblogs.

The LORELEI (Low Resource Languages for Emergent Incidents) program
was concerned with building human language technology for low resource
languages in the context of emergent situations. Representative
languages were selected to provide broad typological coverage.

The knowledge base for entity linking annotation is available
separately as LORELEI Entity Detection and Linking Knowledge Base
(LDC2020T10).

2023 members can access this corpus through their LDC accounts.
Non-members may license this data for a fee.

To unsubscribe from this newsletter, log in to your LDC account and
uncheck the box next to “Receive Newsletter” under Account Options or
contact LDC for assistance.

Membership Coordinator
Linguistic Data Consortium
University of Pennsylvania
T: +1-215-573-1275
E: ldc at ldc.upenn.edu
M: 3600 Market St. Suite 810
Philadelphia, PA 19104

Linguistic Field(s): Computational Linguistics




------------------------------------------------------------------------------


LINGUIST List is supported by the following publishers:

Bloomsbury Publishing (formerly The Continuum International Publishing Group) http://www.bloomsbury.com/uk/

Cascadilla Press http://www.cascadilla.com/

Georgetown University Press http://www.press.georgetown.edu

John Benjamins http://www.benjamins.com/

Lincom GmbH https://lincom-shop.eu/

Multilingual Matters http://www.multilingual-matters.com/

Springer Nature http://www.springer.com

Wiley http://www.wiley.com


----------------------------------------------------------
LINGUIST List: Vol-34-289
----------------------------------------------------------


More information about the LINGUIST mailing list