29.5053, FYI: December 2018 Newsletter - LDC

Wed Dec 19 05:33:35 UTC 2018

LINGUIST List: Vol-29-5053. Wed Dec 19 2018. ISSN: 1069 - 4875.

Subject: 29.5053, FYI: December 2018 Newsletter - LDC

Moderator: linguist at linguistlist.org (Malgorzata E. Cavar)
Reviews: reviews at linguistlist.org (Helen Aristar-Dry, Robert Coté)
Homepage: https://linguistlist.org

Please support the LL editors and operation with a donation at:
           https://funddrive.linguistlist.org/donate/

Editor for this issue: Everett Green <everett at linguistlist.org>
================================================================

Date: Wed, 19 Dec 2018 00:33:15
From: Membership Coordinator [ldc at ldc.upenn.edu]
Subject: December 2018 Newsletter - LDC

In this newsletter:

LDC Membership Discounts for MY2019 Still Available

Spring 2019 LDC Data Scholarship Program - deadline approaching

New publications:

HUB5 Mandarin Telephone Speech and Transcripts Second Edition
Nautilus Speaker Characterization
TAC Relation Extraction Dataset 

LDC Membership Discounts for MY2019 Still Available
Join LDC while membership savings are still available. Now through March 1,
2019, renewing MY2018 members will receive a 10% discount off the membership
fee. New or non-consecutive member organizations will receive a 5% discount.
Membership remains the most economical way to access LDC releases. Visit Join
LDC for details on membership options and benefits.

Spring 2019 LDC Data Scholarship Program - deadline approaching
Students can apply for the Spring 2019 Data Scholarship Program now through
January 15, 2019. The LDC Data Scholarship program provides students with
access to LDC data at no cost. For more information on application
requirements and program rules, please visit LDC Data Scholarships. 

New publications:

(1) HUB5 Mandarin Telephone Speech and Transcripts Second Edition was
developed by LDC in support of US government projects for language recognition
and Large Vocabulary Conversational Speech Recognition (LVCSR). The first
edition was released by LDC in two data sets, HUB5 Mandarin Telephone Speech
Corpus (LDC98S69) and HUB5 Mandarin Transcripts (LDC98T26). This second
edition merges the speech and transcript releases, updates the audio format,
and adds Pinyin transcripts, forced alignment, and updated documentation and
metadata.

This corpus contains approximately 19 hours of Mandarin speech from 42
unscripted telephone conversations between native speakers of Mandarin from
CALLFRIEND Mandarin Chinese-Mainland Dialect (LDC96S55), which has also been
released in a second, updated edition (LDC2018S09) and (2) associated
transcripts of contiguous 5-30 minute segments from those telephone
conversations.

Participants could speak with a person of their choice on any topic; most
called family members and friends. The recorded conversations lasted up to 30
minutes. Transcripts were created manually by native Mandarin speakers in the
GB2312 encoding schema. This release includes Pinyin transcripts and the
original transcripts, both in UTF-8 format. 

HUB5 Mandarin Telephone Speech and Transcripts Second Edition is available via
web download. 
2018 Subscription Members will automatically receive copies of this corpus.
2018 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee.

(2) Nautilus Speaker Characterization was developed at the Technical
University of Berlin and is comprised of approximately 155 hours of
conversational speech from 300 German speakers aged 18 to 35 years (126 males
and 174 females) with no marked dialect or accent, recorded in an
acoustically-isolated room. The corpus was designed to support research on the
detection of speaker social characteristics, such as personality, charisma,
and voice attractiveness.

Four scripted and four semi-spontaneous dialogs simulating telephone call
inquiries were elicited from the speakers. Additionally, spontaneous neutral
and emotional speech utterances (predominantly excitement or frustration) and
questions were produced.

Speech corresponding to one of the semi-spontaneous dialogs was evaluated with
respect to 34 continuous numeric labels of perceived interpersonal speaker
characteristics (such as likable, attractive, competent, childish). For a set
of 20 selected ''extreme'' speakers evaluated for their warmth-attractiveness,
34 naive voice descriptions (such as bright, creaky, articulate, melodious)
were also evaluated. The corpus contains all labels, together with the speech
recordings and the speakers' metadata (e.g., age, gender, place of birth,
chronological places of residence and duration of stay, parents' place of
birth, self-assessed personality).

Nautilus Speaker Characterization is available via web download. 

2018 Subscription Members will receive copies of this corpus provided they
have submitted a completed copy of the special license agreement. 2018
Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data at no cost.

(3) TAC Relation Extraction Dataset (TACRED) was developed by The Stanford NLP
Group and is a large-scale relation extraction dataset with 106,264 examples
built over English newswire and web text used in the NIST TAC KBP English slot
filling evaluations during the period 2009-2014. The annotations were derived
from TAC KBP relation types (see the guidelines), from human annotations
developed by LDC and from crowdsourcing using Mechanical Turk.

Source corpora used for this dataset were TAC KBP Comprehensive English Source
Corpora 2009-2014 (LDC2018T03) and TAC KBP English Regular Slot Filling -
Comprehensive Training and Evaluation Data 2009-2014 (LDC2018T22). For
detailed information about the dataset and benchmark results, please refer to
the TACRED paper.
TAC Relation Extraction Dataset is available via web download. 

2018 Subscription Members will automatically receive copies of this corpus.
2018 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee.

Membership Office
Linguistic Data Consortium
University of Pennsylvania
T: +1-215-573-1275
E: ldc at ldc.upenn.edu
M: 3600 Market St. Suite 810
Philadelphia, PA 19104

Linguistic Field(s): Computational Linguistics

------------------------------------------------------------------------------

*****************    LINGUIST List Support    *****************
Please support the LL editors and operation with a donation at:

              The IU Foundation Crowd Funding site:
       https://iufoundation.fundly.com/the-linguist-list

               The LINGUIST List FundDrive Page:
            https://funddrive.linguistlist.org/donate/

----------------------------------------------------------
LINGUIST List: Vol-29-5053	
----------------------------------------------------------