31.1667, FYI: April 2020 Newsletter - LDC

Mon May 18 18:08:01 UTC 2020

LINGUIST List: Vol-31-1667. Mon May 18 2020. ISSN: 1069 - 4875.

Subject: 31.1667, FYI:  April 2020 Newsletter - LDC

Moderator: Malgorzata E. Cavar (linguist at linguistlist.org)
Student Moderator: Jeremy Coburn
Managing Editor: Becca Morris
Team: Helen Aristar-Dry, Everett Green, Sarah Robinson, Lauren Perkins, Nils Hjortnaes, Yiwen Zhang, Joshua Sims
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org

Homepage: http://linguistlist.org

Please support the LL editors and operation with a donation at:
           https://funddrive.linguistlist.org/donate/

Editor for this issue: Sarah Robinson <srobinson at linguistlist.org>
================================================================

Date: Mon, 18 May 2020 14:07:44
From: Membership Coordinator [ldc at ldc.upenn.edu]
Subject: April 2020 Newsletter - LDC

In this newsletter: 

New Publications:
LORELEI Oromo Incident Language Pack
LORELEI Entity Detection and Linking Knowledge Base
BOLT English Translation Treebank - Chinese Discussion Forum
Multi-Language Conversational Telephone Speech 2011 -- Mandarin Chinese

New publications:

- LORELEI Oromo Incident Language Pack was developed by LDC and is comprised
of approximately 3.9 million words of Oromo monolingual text, 25,000 words of
English monolingual text, 135,000 words of parallel and comparable
Oromo-English text, and 50,000 words of data annotated for Entity Discovery
and Linking and for Situation Frames. It contains all of the text data,
annotations, supplemental resources, and related software tools for the Oromo
language that were used in the DARPA LORELEI / LoReHLT 2017 Evaluation. 

The knowledge base for the entity linking annotation in this corpus is
available separately as LORELEI Entity Detection and Linking Knowledge Base
(LDC2020T10). 

LORELEI Oromo Incident Language Pack is distributed via web download. 

2020 Subscription Members will automatically receive copies of this corpus.
2020 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee. 

- LORELEI Entity Detection and Linking Knowledge Base was developed by LDC and
contains the full LORELEI Entity Detection and Linking (EDL) Knowledge Base
(KB) used for all LORELEI Representative Language and Incident Language Pack
entity linking annotation. The LORELEI (Low Resource Languages for Emergent
Incidents) Program was concerned with building human language technology for
low resource languages in the context of emergent situations like natural
disasters or disease outbreaks.

LORELEI Entity Detection and Linking Knowledge Base is distributed via web
download. 

2020 Subscription Members will automatically receive copies of this corpus.
2020 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee.

- BOLT English Translation Treebank - Chinese Discussion Forum was developed
by LDC and consists of 147,432 tokens of web discussion forum data translated
from Chinese to English and annotated for part-of-speech and syntactic
structure.

BOLT English Translation Treebank - Chinese Discussion Forum is distributed
via web download. 

2020 Subscription Members will automatically receive copies of this corpus.
2020 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee. 

- Multi-Language Conversational Telephone Speech 2011 -- Mandarin Chinese was
developed by LDC and is comprised of approximately 25 hours of telephone
speech in Mandarin Chinese.

The data were collected primarily to support research and technology
evaluation in automatic language identification, and portions of these
telephone calls were used in the NIST 2011 Language Recognition Evaluation
(LRE). Participants were recruited by native speakers who contacted
acquaintances in their social network. Those native speakers made one call, up
to 15 minutes, to each acquaintance. The data was collected using LDC's
telephone collection infrastructure, comprised of three computer telephony
systems. Human auditors labeled calls for callee gender, dialect type, and
noise.

Multi-Language Conversational Telephone Speech 2011 -- Mandarin Chinese is
distributed via web download. 

2020 Subscription Members will automatically receive copies of this corpus.
2020 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee.

Membership Coordinator
Linguistic Data Consortium
University of Pennsylvania
T: +1-215-573-1275
E: ldc at ldc.upenn.edu
M: 3600 Market St. Suite 810
Philadelphia, PA 19104

Linguistic Field(s): Computational Linguistics

------------------------------------------------------------------------------

***************************    LINGUIST List Support    ***************************
 The 2019 Fund Drive is under way! Please visit https://funddrive.linguistlist.org
  to find out how to donate and check how your university, country or discipline
     ranks in the fund drive challenges. Or go directly to the donation site:
               https://iufoundation.fundly.com/the-linguist-list-2019

                        Let's make this a short fund drive!
                Please feel free to share the link to our campaign:
                    https://funddrive.linguistlist.org/donate/

----------------------------------------------------------
LINGUIST List: Vol-31-1667	
----------------------------------------------------------