29.2941, FYI: July 2018 Newsletter - LDC

Wed Jul 18 13:45:56 UTC 2018

LINGUIST List: Vol-29-2941. Wed Jul 18 2018. ISSN: 1069 - 4875.

Subject: 29.2941, FYI: July 2018 Newsletter - LDC

Moderator: linguist at linguistlist.org (Malgorzata E. Cavar)
Reviews: reviews at linguistlist.org (Helen Aristar-Dry, Robert Coté)
Homepage: https://linguistlist.org

Please support the LL editors and operation with a donation at:
           https://funddrive.linguistlist.org/donate/

Editor for this issue: Kenneth Steimel <ken at linguistlist.org>
================================================================

Date: Wed, 18 Jul 2018 09:45:41
From: LDC Membership Office [ldc at ldc.upenn.edu]
Subject: July 2018 Newsletter - LDC

In this newsletter: 
Fall 2018 Data Scholarship Program

New Publications:
CALLFRIEND Mandarin Chinese-Mainland Dialect Second Edition
RATS Language Identification
TRAD Chinese-French Parallel Text – Broadcast News

Fall 2018 LDC Data Scholarship Program

Student applications for the Fall 2018 LDC Data Scholarship program are being
accepted now through September 15, 2018. This scholarship program provides
university students with access to LDC data at no cost. Students must complete
an application which consists of a data use proposal and letter of support
from their advisor.

For application requirements and program rules, please visit the LDC Data
Scholarship page.

New publications:

(1) CALLFRIEND Mandarin Chinese-Mainland Dialect Second Edition was developed
by LDC and consists of approximately 24 hours of unscripted telephone
conversations between native speakers of the Mandarin Chinese dialect spoken
in mainland China. This second edition updates the audio files to wav format,
simplifies the directory structure and adds documentation and metadata. The
first edition is available as CALLFRIEND Mandarin Chinese-Mainland Dialect
(LDC96S55).

All data was collected before July 1997. Participants could speak with a
person of their choice on any topic; most called family members and friends.
All calls originated in North America. The recorded conversations last up to
30 minutes.

CALLFRIEND Mandarin Chinese-Mainland Dialect Second Edition is distributed via
web download.

2018 Subscription Members will receive copies of this corpus. 2018 Standard
Members may request a copy as part of their 16 free membership corpora.
Non-members may license this data for a fee.

(2) RATS Language Identification was developed by LDC and is comprised of
approximately 5,400 hours of Levantine Arabic, Farsi, Dari, Pashto and Urdu
conversational telephone speech with annotation of speech segments. The corpus
was created to provide training, development and initial test sets for the
Language Identification (LID) task in the DARPA RATS (Robust Automatic
Transcription of Speech) program.

The source audio consists of conversational telephone speech recordings from:
(1) conversational telephone speech (CTS) recordings, taken either from
previous LDC CTS corpora, or from CTS data collected specifically for the RATS
program from Levantine Arabic, Pashto, Urdu, Farsi and Dari native speakers;
and (2) portions of VOA broadcast news recordings, taken from data used in the
2009 NIST Language Recognition Evaluation. The 2009 LRE Test Set is available
from LDC as LDC2014S06.

CTS recordings were audited by annotators who listened to short segments and
determined whether the audio was in the target language. Annotations on the
audio files include start time, end time, speech activity detection (SAD)
label, SAD provenance, language ID and LID provenance.

RATS Language Identification is distributed via hard drive.

2018 Subscription Members will receive copies of this corpus. 2018 Standard
Members may request a copy as part of their 16 free membership corpora.
Non-members may license this data for a fee.

(3) TRAD Chinese-French Parallel Text -- Broadcast News was developed by ELDA
as part of the PEA-TRAD project. It contains French translations of a subset
of approximately 30,000 Chinese characters from GALE Phase 1 Chinese Broadcast
News Parallel Text - Part 3 (LDC2008T18). The purpose of the PEA-TRAD project
(Translation as a Support for Document Analysis) was to develop
speech-to-speech translation technology for multiple languages (e.g., Arabic,
Chinese, Pashto) from a variety of domains. 

This release consists of 977 segments (translation units) from 139 documents.
The Chinese source file contains 33,571 characters and the French reference
translation contains 22,424 words.  The source data is Chinese broadcast news
collected and translated into English by LDC for the DARPA GALE (Global
Autonomous Language Exploitation) program. Information about the ELDA
translation team, translation guidelines and validation results is contained
in the documentation accompanying this release.

TRAD Chinese-French Parallel Text – Broadcast News is distributed via web
download.

2018 Subscription Members will receive copies of this corpus provided they
have submitted a completed copy of the special license agreement. 2018
Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee.

Membership Office
Linguistic Data Consortium
University of Pennsylvania
T: +1-215-573-1275
E: ldc at ldc.upenn.edu
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104

Linguistic Field(s): Computational Linguistics

------------------------------------------------------------------------------

*****************    LINGUIST List Support    *****************
Please support the LL editors and operation with a donation at:

              The IU Foundation Crowd Funding site:
       https://iufoundation.fundly.com/the-linguist-list

               The LINGUIST List FundDrive Page:
            https://funddrive.linguistlist.org/donate/

----------------------------------------------------------
LINGUIST List: Vol-29-2941	
----------------------------------------------------------