28.4892, FYI: November 2017 Newsletter - LDC

Tue Nov 21 20:25:36 UTC 2017

LINGUIST List: Vol-28-4892. Tue Nov 21 2017. ISSN: 1069 - 4875.

Subject: 28.4892, FYI: November 2017 Newsletter - LDC

Moderators: linguist at linguistlist.org (Damir Cavar, Malgorzata E. Cavar)
Reviews: reviews at linguistlist.org (Helen Aristar-Dry, Robert Coté,
                                   Michael Czerniakowski)
Homepage: http://linguistlist.org

Please support the LL editors and operation with a donation at:
           http://funddrive.linguistlist.org/donate/

Editor for this issue: Kenneth Steimel <ken at linguistlist.org>
================================================================

Date: Tue, 21 Nov 2017 15:24:22
From: Membership Office [ldc at ldc.upenn.edu]
Subject: November 2017 Newsletter - LDC

In this newsletter: 

Join LDC for Membership Year 2018

Spring 2018 Data Scholarship Program
Commercial use and LDC data

New Publications:

(1) ASpIRE Development and Development Test Sets

(2) CIEMPIESS Light

(3) IARPA Babel Kurmanji Kurdish Language Pack IARPA-babel205b-v1.0a

(4) TAC KBP Chinese Cross-lingual Entity Linking - Comprehensive Training &
Evaluation Data 2011-2014

Join LDC for Membership Year 2018

Membership Year 2018 (MY2018) is open for joining and discounts are available
for those who keep their membership current and join early in the year. Now
through March 1, 2018, current MY2017 members who renew before March 1 will
receive a 10% discount off the membership fee. New or returning organizations
will receive a 5% discount through March 1.  

In addition to receiving new publications, current year LDC members also enjoy
the benefit of licensing older data at reduced costs from our Catalog of over
700 holdings; current year for-profit members may use most data for commercial
applications.

Plans for MY2018 publications are in progress. Among the expected releases
are: 

- Multilanguage conversational telephone speech: developed to support language
identification research in related languages (Central Asian, Central European
language groups)
- DIRHA (Distant-speech Interaction for Robust Home Applications):  Wall
Street Journal read speech with noise and reverberation, suitable for various
multi-microphone signal processing and distant speech recognition tasks
- TRAD corpora: Chinese-French and Arabic-French parallel text (newswire, web
data)
- IARPA Babel Language Packs (telephone speech and transcripts): languages
include Cebuano, Guarani, Kazakh, Lithuanian, Telugu, Tok Pisin
- BOLT: discussion forum, SMS, word-aligned, and tagged data in all languages
(Egyptian Arabic, English, Chinese)
- DEFT: Spanish Treebank (newswire, web data)
- RATS:  Language Identification data set (Dari, Farsi, Levantine Arabic,
Pashto, Urdu; degraded audio signals)
- TAC KBP: comprehensive English source and entity linked data (broadcast,
telephone speech, newswire, web data)
- German children’s handwriting: longitudinal study of weekly writing in
classroom setting with enhanced output for specific spelling patterns

And don’t forget, MY2017 and MY2016 are still open for joining. MY2016 can be
joined through December 31, 2017 and includes data such as BOLT Chinese
Discussion Forums, IARPA Babel Language Packs in multiple languages and
Multi-Language Conversational Telephone Speech – Slavic Group. MY 2017 will
remain open through December 31, 2018; among the year’s releases are 2010 NIST
Speaker Recognition Evaluation Test Set, RATS Keyword Spotting, Noisy TIMIT
Speech and BOLT Egyptian Arabic SMS/Chat and Transliteration. For full
descriptions of these data sets, browse our Catalog.  
Visit Join LDC for details on membership, user accounts and payment.

Spring 2018 Data Scholarship Program:

Applications are now being accepted through January 15, 2018 for the Spring
2018 LDC Data Scholarship program which provides university students with
no-cost access to LDC data. Consult the LDC Data Scholarship page for more
information about program rules and submission requirements. 

Commercial use and LDC data:

For-profit organizations are reminded that an LDC membership is a
pre-requisite for obtaining a commercial license to almost all LDC databases.
Non-member organizations, including non-member for-profit organizations,
cannot use LDC data to develop or test products for commercialization, nor can
they use LDC data in any commercial product or for any commercial purpose. LDC
data users should consult corpus-specific license agreements for limitations
on the use of certain corpora. Visit the Licensing page for further
information. 

New publications:

(1) ASpIRE Development and Development Test Sets was developed for the
Automatic Speech recognition In Reverberant Environments (ASpIRE) Challenge
sponsored by IARPA (the Intelligent Advanced Research Projects Activity). It
contains approximately 226 hours of English speech with transcripts and
scoring files.

The audio data is a subset of Mixer 6 Speech (LDC2013S03), audio recordings of
interviews, transcript readings and conversational telephone speech collected
by LDC in 2009 and 2010 from native English speakers local to the Philadelphia
area. The transcripts were developed by Appen for the ASpIRE challenge.

Data is divided into development and development test sets.

ASpIRE Development and Development Test Sets is distributed via web download.

2017 Subscription Members will receive copies of this corpus provided they
have submitted a completed copy of the special license agreement. 2017
Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee.

(2) CIEMPIESS Light (Corpus de Investigación en Español de México del Posgrado
de Ingeniería Eléctrica y Servicio Social) Light was developed by the Speech
Processing Laboratory of the Faculty of Engineering at the National Autonomous
University of Mexico (UNAM) and consists of approximately 18 hours of Mexican
Spanish radio and television speech and associated transcripts. The goal of
this work was to create acoustic models for automatic speech recognition. For
more information and documentation see the CIEMPIESS-UNAM Project website.

CIEMPIESS Light is an updated version of CIEMPIESS, released by LDC as
LDC2015S07. This ''light'' version contains speech and transcripts presented
in a revised directory structure that allows for use with the Kaldi toolkit. 

The speech recordings were collected from Podcast UNAM, a program created by
Radio-IUS, and Mirador Universitario, a TV program broadcast by UNAM. They are
comprised of spontaneous conversations in Mexican Spanish between a moderator
and guests. 

The audio files are in 16 kHz, 16-bit PCM flac format, and transcripts are
presented as UTF-8 encoded plain text.
CIEMPIESS Light is distributed via web download. 
2017 Subscription Members will receive copies of this corpus. 2017 Standard
Members may request a copy as part of their 16 free membership corpora.
Non-members may license this data at no cost.

(3) IARPA Babel Kurmanji Kurdish Language Pack IARPA-babel205b-v1.0a was
developed by Appen for the IARPA (Intelligence Advanced Research Projects
Activity) Babel program. It contains approximately 203 hours of Kurmanji
Kurdish conversational and scripted telephone speech collected in 2013 and
2014 along with corresponding transcripts.

The Kurmanji Kurdish speech in this release represents that spoken in the
southeastern and eastern Anatolian regions of Turkey. The gender distribution
among speakers is approximately 37% female and 63% male; speakers' ages range
from 16 years to 70 years. Calls were made using different telephones (e.g.,
mobile, landline) from a variety of environments including the street, a home
or office, a public place, and inside a vehicle.

IARPA Babel Kurmanji Kurdish Language Pack IARPA-babel205b-v1.0a is
distributed via web download.

2017 Subscription Members will receive copies of this corpus provided they
have submitted a completed copy of the special license agreement. 2017
Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee.

(4) TACKBP Chinese Cross-lingual Entity Linking - Comprehensive Training &
Evaluation Data 2011-2014 was developed by LDC and contains training and
evaluation data produced in support of the TAC KBP Chinese Cross-lingual
Entity Linking tasks in 2011, 2012, 2013 and 2014. It includes queries and
gold standard entity type information, Knowledge Base links, and equivalence
class clusters for NIL entities along with the source documents for the
queries, specifically, English and Chinese newswire, discussion forum and web
data. The corresponding knowledge base is available as TAC KBP Reference
Knowledge Base (LDC2014T16).

The goal of TAC KBP’s entity linking track is to measure systems’ ability to
determine whether an entity, specified by a query, has a matching node in a
reference knowledge base and if so, to create a link between the two. If there
is no matching node, entity linking systems are required to cluster the
mention together with others referencing the same entity. More information
about the TAC KBP Entity Linking task and other TAC KBP evaluations can be
found on the NIST TAC website.

TAC KBP Chinese Cross-lingual Entity Linking - Comprehensive Training and
Evaluation Data 2011-2014 is distributed via web download.

2017 Subscription Members will receive copies of this corpus. 2017 Standard
Members may request a copy as part of their 16 free membership corpora.
Non-members may license this data for a fee.

Membership Office
Linguistic Data Consortium
University of Pennsylvania
T: +1-215-573-1275
E: ldc at ldc.upenn.edu
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104

Linguistic Field(s): Computational Linguistics

------------------------------------------------------------------------------

*****************    LINGUIST List Support    *****************
Please support the LL editors and operation with a donation at:
            http://funddrive.linguistlist.org/donate/

----------------------------------------------------------
LINGUIST List: Vol-28-4892	
----------------------------------------------------------