29.3676, FYI: September 2018 Newsletter - LDC

Mon Sep 24 17:46:18 UTC 2018

LINGUIST List: Vol-29-3676. Mon Sep 24 2018. ISSN: 1069 - 4875.

Subject: 29.3676, FYI: September 2018 Newsletter - LDC

Moderator: linguist at linguistlist.org (Malgorzata E. Cavar)
Reviews: reviews at linguistlist.org (Helen Aristar-Dry, Robert Coté)
Homepage: https://linguistlist.org

Please support the LL editors and operation with a donation at:
           https://funddrive.linguistlist.org/donate/

Editor for this issue: Everett Green <everett at linguistlist.org>
================================================================

Date: Mon, 24 Sep 2018 13:45:37
From: Membership Office [ldc at ldc.upenn.edu]
Subject: September 2018 Newsletter - LDC

In this newsletter: 

New Publications:

BOLT Information Retrieval Comprehensive Training and Evaluation
HAVIC MED Event E051-E060 -- Videos, Metadata and Annotation
Multi-Language Conversational Telephone Speech 2011 -- Spanish
IARPA Babel Kazakh Language Pack IARPA-babel302b-v1.0a

New publications:

(1) BOLT Information Retrieval Comprehensive Training and Evaluation was
developed by LDC and consists of all data produced in support of the
Information Retrieval (IR) task within the DARPA Broad Operational Language
Translation (BOLT) Program, including annotations, source documents and
scoring software.

The BOLT IR task sought to support development of systems that could take as
input a natural language English query sentence, return relevant responses to
that query from a large corpus of informal documents in the three BOLT
languages (Arabic, Chinese, and English) and translate responses from
non-English documents into English. This release contains (1) natural-language
IR queries, system responses to queries, and manually-generated assessment
judgments for system responses; (2) discussion forum source documents in
Arabic, Chinese and English; (3) scoring software for each evaluation phase;
and (4) experimental data developed in Phase 2. 

BOLT Information Retrieval Comprehensive Training and Evaluation is
distributed via web download.

2018 Subscription Members will automatically receive copies of this corpus.
2018 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee.

(2) HAVIC MED Event E051-E060 -- Videos, Metadata and Annotation was developed
by LDC and is comprised of approximately 53 hours of user-generated videos
with annotation and metadata. To advance multimodal event detection and
related technologies, LDC developed, in collaboration with NIST (the National
Institute of Standards and Technology), a large, heterogeneous, annotated
multimodal corpus for HAVIC (the Heterogeneous Audio Visual Internet
Collection) that was used in the NIST-sponsored MED (Multimedia Event
Detection) task for several years. HAVIC MED Event E051-E060 is a subset of
that corpus, specifically, a collection of event videos for the HAVIC Project
originally released to support the 2016 Multimedia Event Detection task.

The data consists of videos of various events (event videos) and videos
completely unrelated to events (background videos) harvested by a large team
of human annotators. Each event video was manually annotated with a set of
judgments describing its event properties and other salient features.
Background videos were labeled with topic and genre categories.

HAVIC MED Event E051-E060 -- Videos, Metadata and Annotation is distributed
via web download.

2018 Subscription Members will automatically receive copies of this corpus.
2018 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee.

(3) Multi-Language Conversational Telephone Speech 2011 -- Spanish was
developed by LDC and is comprised of approximately 23 hours of telephone
speech in Spanish.

The data were collected primarily to support research and technology
evaluation in automatic language identification, and portions of these
telephone calls were used in the NIST 2011 Language Recognition Evaluation
(LRE). Participants were recruited by native speakers who contacted
acquaintances in their social network. Those native speakers made one call, up
to 15 minutes, to each acquaintance. Human auditors labeled the calls for
callee gender, dialect type, and noise.

LDC has also released the following as part of the Multi-Language
Conversational Telephone Speech 2011 series:

- Slavic Group (LDC2016S11)
- Turkish (LDC2017S09)
- South Asian (LDC2017S14)
- Central Asian (LDC2018S03)
- Central European (LDC2018S08)

Multi-Language Conversational Telephone Speech 2011 -- Spanish is distributed
via web download.

2018 Subscription Members will automatically receive copies of this corpus.
2018 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee.

(4) IARPA Babel Kazakh Language Pack IARPA-babel302b-v1.0a was developed by
Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel
program. It contains approximately 203 hours of Kazakh conversational and
scripted telephone speech collected in 2013 and 2014 along with corresponding
transcripts.

The Kazakh speech in this release represents that spoken in the Northeastern
and Southern dialect regions of Kazakhstan. The gender distribution among
speakers is approximately equal; speakers' ages range from 16 years to 64
years. Calls were made using different telephones (e.g., mobile, landline)
from a variety of environments including the street, a home or office, a
public place, and inside a vehicle.

IARPA Babel Kazakh Language Pack IARPA-babel302b-v1.0a is available via web
download.

2018 Subscription Members will receive copies of this corpus provided they
have submitted a completed copy of the special license agreement. 2018
Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee.

Membership Office
Linguistic Data Consortium
University of Pennsylvania
T: +1-215-573-1275
E: ldc at ldc.upenn.edu
M: 3600 Market St. Suite 810, Philadelphia, PA 19104

Linguistic Field(s): Computational Linguistics

------------------------------------------------------------------------------

*****************    LINGUIST List Support    *****************
Please support the LL editors and operation with a donation at:

              The IU Foundation Crowd Funding site:
       https://iufoundation.fundly.com/the-linguist-list

               The LINGUIST List FundDrive Page:
            https://funddrive.linguistlist.org/donate/

----------------------------------------------------------
LINGUIST List: Vol-29-3676	
----------------------------------------------------------