30.2090, FYI: May 2019 Newsletter - LDC

Fri May 17 01:52:41 UTC 2019

LINGUIST List: Vol-30-2090. Thu May 16 2019. ISSN: 1069 - 4875.

Subject: 30.2090, FYI: May 2019 Newsletter - LDC

Moderator: Malgorzata E. Cavar (linguist at linguistlist.org)
Student Moderator: Jeremy Coburn
Managing Editor: Becca Morris
Team: Helen Aristar-Dry, Everett Green, Sarah Robinson, Peace Han, Nils Hjortnaes, Yiwen Zhang, Julian Dietrich
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org

Homepage: http://linguistlist.org

Please support the LL editors and operation with a donation at:
           https://funddrive.linguistlist.org/donate/

Editor for this issue: Everett Green <everett at linguistlist.org>
================================================================

Date: Thu, 16 May 2019 21:52:13
From: Membership Coordinator [ldc at ldc.upenn.edu]
Subject: May 2019 Newsletter - LDC

In this newsletter: 
New Publications:

Multi-Language Conversational Telephone Speech 2011 -- English Group

TAC KBP Chinese Regular Slot Filling - Comprehensive 
Training and Evaluation Data 2014

CIEMPIESS Experimentation

IARPA Babel Guarani Language Pack IARPA-babel305b-v1.0c

New publications:

(1) Multi-Language Conversational Telephone Speech 2011 -- English Group was
developed by LDC and is comprised of approximately 18 hours of telephone
speech in two general varieties of English: American and South Asian.

The data were collected primarily to support research and technology
evaluation in automatic language identification, and portions of these
telephone calls were used in the NIST 2011 Language Recognition Evaluation
(LRE). Participants were recruited by native speakers who contacted
acquaintances in their social network. Those native speakers made one call, up
to 15 minutes, to each acquaintance. Calls are labeled by human auditors for
callee gender, dialect type, and noise. 

LDC has also released the following as part of the Multi-Language
Conversational Telephone Speech 2011 series:

- Slavic Group (LDC2016S11)
- Turkish (LDC2017S09)
- South Asian (LDC2017S14)
- Central Asian (LDC2018S03)
- Central European (LDC2018S08)
- Spanish (LDC2018S12)
- Arabic (LDC2019S02)

Multi-Language Conversational Telephone Speech 2011 -- English Group is
distributed via web download. 

2019 Subscription Members will automatically receive copies of this corpus.
2019 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee. 

*

(2) TAC KBP Chinese Regular Slot Filling - Comprehensive Training and
Evaluation Data 2014 was developed by LDC and contains training and evaluation
data produced in support of the TAC KBP Chinese Regular Slot Filling
evaluation track conducted in 2014. This release includes queries, the 'manual
runs' (human-produced responses to the queries), the final rounds of
assessment results, and the complete set of Chinese source documents.

The regular Chinese Slot Filling evaluation track involved mining information
about entities from text. In completing the task, participating systems and
LDC annotators searched a corpus for information on certain attributes (slots)
of person and organization entities and attempted to return all valid answers
(slot fillers) in the source collection. 

TAC KBP Chinese Regular Slot Filling - Comprehensive Training and Evaluation
Data 2014 is distributed via web download. 

2019 Subscription Members will automatically receive copies of this corpus.
2019 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee.  

*

(3) CIEMPIESS Experimentation (Corpus de Investigación en Español de México
del Posgrado de Ingeniería Eléctrica y Servicio Social) was developed by the
Facultad de Ingeniería at the National Autonomous University of Mexico (UNAM)
and consists of approximately 22 hours of Mexican Spanish broadcast and read
speech with associated transcripts. The goal of this work was to create
acoustic models for automatic speech recognition. For more information and
documentation see the CIEMPIESS-UNAM Project website.
CIEMPIESS Experimentation is a set of three different data sets, specifically
Complementary, Fem, and Test. Complementary is a phonetically-balanced corpus
of isolated Spanish words spoken in Central Mexico. Fem contains broadcast
speech from 21 female speakers, collected to balance by gender the number of
recordings from male speakers in other CIEMPIESS collections. Test consists of
10 hours of broadcast speech and transcripts and is intended for use as a
standard test data set alongside other CIEMPIESS corpora.

Most of the speech recordings in Fem and Test were collected from Radio-IUS, a
UNAM radio station. Other recordings were taken from IUS Canal Multimedia and
Centro Universitario de Estudios Jurídicos (CUEJ UNAM). Those two channels
feature videos with speech around legal issues and topics related to UNAM. The
Complementary recordings consist of read speech collected for that corpus.
LDC has released the following data sets in the CIEMPIESS series:

- CIEMPIESS (LDC2015S07)
- CHM150 (LDC2016S04)
- CIEMPIESS Light (LDC2017S23)
- CIEMPIESS Balance (LDC2018S11)

CIEMPIESS Experimentation is distributed via web download. 

2019 Subscription Members will automatically receive copies of this corpus.
2019 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data at no cost.

*

(3) IARPA Babel Guarani Language Pack IARPA-babel305b-v1.0c was developed by
Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel
program. This corpus contains approximately 198 hours of Guarani
conversational and scripted telephone speech collected in 2014 and 2015 along
with corresponding transcripts.

The Guarani speech in this release represents that spoken in Paraguay. The
gender distribution among speakers is approximately equal; speakers' ages
range from 16 years to 67 years. Calls were made using different telephones
(e.g., mobile, landline) from a variety of environments including the street,
a home or office, a public place, and inside a vehicle.

IARPA Babel Guarani Language Pack IARPA-babel305b-v1.0c is distributed via web
download. 

2019 Subscription Members will receive copies of this corpus provided they
have submitted a completed copy of the special license agreement. 2019
Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee.

*

Membership Office
Linguistic Data Consortium
University of Pennsylvania
T: +1-215-573-1275
E: ldc at ldc.upenn.edu
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104

Linguistic Field(s): Computational Linguistics

------------------------------------------------------------------------------

***************************    LINGUIST List Support    ***************************
 The 2019 Fund Drive is under way! Please visit https://funddrive.linguistlist.org
  to find out how to donate and check how your university, country or discipline
     ranks in the fund drive challenges. Or go directly to the donation site:
               https://iufoundation.fundly.com/the-linguist-list-2019

                        Let's make this a short fund drive!
                Please feel free to share the link to our campaign:
                    https://funddrive.linguistlist.org/donate/

----------------------------------------------------------
LINGUIST List: Vol-30-2090	
----------------------------------------------------------