30.4404, FYI: November 2019 Newsletter - LDC

Wed Nov 20 04:57:38 UTC 2019

LINGUIST List: Vol-30-4404. Tue Nov 19 2019. ISSN: 1069 - 4875.

Subject: 30.4404, FYI: November 2019 Newsletter - LDC

Moderator: Malgorzata E. Cavar (linguist at linguistlist.org)
Student Moderator: Jeremy Coburn
Managing Editor: Becca Morris
Team: Helen Aristar-Dry, Everett Green, Sarah Robinson, Peace Han, Nils Hjortnaes, Yiwen Zhang, Julian Dietrich
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org

Homepage: http://linguistlist.org

Please support the LL editors and operation with a donation at:
           https://funddrive.linguistlist.org/donate/

Editor for this issue: Everett Green <everett at linguistlist.org>
================================================================

Date: Tue, 19 Nov 2019 23:56:59
From: Membership Coordinator [ldc at ldc.upenn.edu]
Subject: November 2019 Newsletter - LDC

In this newsletter: 

Join LDC for Membership Year 2020 
Spring 2020 Data Scholarship Program

New Publications:

DEFT English Committed Belief Annotation
CALLFRIEND American English-Non-Southern Dialect Second Edition
TAC KBP Cold Start - Comprehensive Evaluation Data 2012-2017
IARPA Babel Amharic Language Pack IARPA-babel307b-v1.0b

--

Join LDC for Membership Year 2020

Membership Year 2020 (MY2020) is open and discounts are available for those
who keep their membership current and join early in the year. Now through
March 2, 2020, current MY2019 members who renew their LDC membership before
March 2 will receive a 10% discount off the membership fee. New or returning
organizations will receive a 5% discount through March 2. 

In addition to receiving new publications, current LDC members also enjoy the
benefit of licensing older data at reduced costs from our Catalog of over 800
holdings. Current-year for-profit members may use most data for commercial
applications.

Plans for MY2020 publications are in progress. Among the expected releases
are:
- Abstract Meaning Representation (AMR) Annotation Release 3.0: semantic
treebank of over 59,000 English natural language sentences from broadcast
conversations, newswire, weblogs and web discussion forums; updates the second
version (LDC2017T10) with new annotations
- TAC KBP: English sentiment slot filling, surprise slot filling, nugget
detection and coreference, and event argument data in all languages (English,
Chinese and Spanish)
- DEFT Chinese ERE: Chinese discussion forum data annotated for entities,
relations and events
- LibriVox Spanish: 73 hours of Spanish audiobook read speech and transcripts 
- IARPA Babel Language Packs (telephone speech and transcripts): languages
include Dhuluo, Javanese and Mongolian
- HAVIC Med Training data: web video, metadata, and annotations for developing
multimedia systems
- RATS Speaker Identification: conversational telephone speech in Levantine
Arabic, Pashto, Urdu, Farsi and Dari on degraded audio signals with annotation
of speech segments for speaker identification
- BOLT: discussion forums, SMS/chat, conversational telephone speech,
word-aligned, tagged and co-reference data in all languages (Chinese, Egyptian
Arabic, and English)

It’s also not too late to join for MY2018 (through December 31, 2019) and
MY2019 (through December 31, 2020). Data sets from those years include
Concretely Annotated New York Times and English Gigaword, DIRHA English WSJ
Audio, BOLT English Treebank – Discussion Forum, First DIHARD Challenge
Development and Evaluation releases, Penn Discourse Treebank Version 3.0, and
2016 NIST Speaker Recognition Evaluation Test Set.

For full descriptions of all LDC data sets, browse our Catalog.  

Visit Join LDC for details on membership, user accounts and payment.

Spring 2020 Data Scholarship Program
Applications are now being accepted through January 15, 2020 for the Spring
2020 LDC Data Scholarship program which provides university students with
no-cost access to LDC data. Consult the LDC Data Scholarship page for more
information about program rules and submission requirements.

--

New Publications:

(1) DEFT English Committed Belief Annotation was developed by LDC and consists
of approximately 950,000 words of English discussion forum text annotated for
''committed belief,'' which marks the level of commitment displayed by the
author to the truth of the propositions expressed in the text.

DARPA's Deep Exploration and Filtering of Text (DEFT) program aimed to address
remaining capability gaps in state-of-the-art natural language processing
technologies related to inference, causal relationships, and anomaly
detection. LDC supported the DEFT program by collecting, creating, and
annotating a variety of data sources.

DEFT English Committed Belief Annotation is distributed via web download. 

2019 Subscription Members will automatically receive copies of this corpus.
2019 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee. 

*

(2) CALLFRIEND American English-Non-Southern Dialect Second Edition was
developed by LDC and consists of approximately 26 hours of unscripted
telephone conversations between native speakers of non-Southern dialects of
American English. This second edition updates the audio files to wav format,
simplifies the directory structure, and adds documentation and metadata. The
first edition is available as CALLFRIEND American English-Non-Southern Dialect
(LDC96S46).

All data was collected before July 1997. Participants could speak with a
person of their choice on any topic; most called family members and friends.
All calls originated in North America. The recorded conversations last up to
30 minutes.

CALLFRIEND American English-Non-Southern Dialect Second Edition is distributed
via web download. 

2019 Subscription Members will automatically receive copies of this corpus.
2019 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee.

*

(3) TAC KBP Cold Start - Comprehensive Evaluation Data 2012-2017 was developed
by LDC and contains Chinese, English, and Spanish data produced in support of
the TAC KBP Cold Start evaluation track conducted from 2012 to 2017. This
corpus includes source documents, queries, assessments, manual runs, and final
assessments.

In the Cold Start track, systems were evaluated on their ability to construct
a new knowledge base (KB) from information provided in a text collection in
combination with technologies developed in other TAC KBP tracks -- slot
filling, information extraction, question answering, and entity discovery and
linking. Cold Start systems were required to find all entities in the text,
and the KB must have ideally included every person, organization, and
geo-political entity as well as all the targeted relations between them. To
facilitate the evaluation of those KBs, LDC annotators created sets of
queries, human-generated responses to the queries, and assessments of both
human and system responses.

The source data in this release is comprised of English and Spanish newswire
and web text collected by LDC for the 2012, 2014, and 2015 evaluations, and
the 2016 pilot collection. The source collections for the 2016 and 2017
evaluations, which include Chinese data, are available in TAC KBP Evaluation
Source Corpora 2016-2017 (LDC2019T12). The archived 2013 Cold Start source
data collection is available from NIST upon request.

TAC KBP Cold Start - Comprehensive Evaluation Data 2012-2017 is distributed
via web download. 

2019 Subscription Members will automatically receive copies of this corpus.
2019 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee.

*

(4) IARPA Babel Amharic Language Pack IARPA-babel307b-v1.0b was developed by
Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel
program. It contains approximately 204 hours of Amharic conversational and
scripted telephone speech collected in 2014 along with corresponding
transcripts.

The Amharic speech in this release represents the Addis Ababa, Shewa, and
Gondar dialect regions of Ethiopia. The gender distribution among speakers is
approximately equal; speakers' ages range from 16 years to 60 years. Calls
were made using different telephones (e.g., mobile, landline) from a variety
of environments including the street, a home or office, a public place, and
inside a vehicle.

IARPA Babel Amharic Language Pack IARPA-babel307b-v1.0b is distributed via web
download. 

2019 Subscription Members will automatically receive copies of this corpus.
2019 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee.

Membership Office
Linguistic Data Consortium
University of Pennsylvania
T: +1-215-573-1275
E: ldc at ldc.upenn.edu
M: 3600 Market St. Suite 810
Philadelphia, PA 19104

Linguistic Field(s): Computational Linguistics

------------------------------------------------------------------------------

***************************    LINGUIST List Support    ***************************
 The 2019 Fund Drive is under way! Please visit https://funddrive.linguistlist.org
  to find out how to donate and check how your university, country or discipline
     ranks in the fund drive challenges. Or go directly to the donation site:
               https://iufoundation.fundly.com/the-linguist-list-2019

                        Let's make this a short fund drive!
                Please feel free to share the link to our campaign:
                    https://funddrive.linguistlist.org/donate/

----------------------------------------------------------
LINGUIST List: Vol-30-4404	
----------------------------------------------------------