30.3914, FYI: October 2019 Newsletter - LDC

The LINGUIST List linguist at listserv.linguistlist.org
Thu Oct 17 05:28:15 UTC 2019


LINGUIST List: Vol-30-3914. Thu Oct 17 2019. ISSN: 1069 - 4875.

Subject: 30.3914, FYI: October 2019 Newsletter - LDC

Moderator: Malgorzata E. Cavar (linguist at linguistlist.org)
Student Moderator: Jeremy Coburn
Managing Editor: Becca Morris
Team: Helen Aristar-Dry, Everett Green, Sarah Robinson, Peace Han, Nils Hjortnaes, Yiwen Zhang, Julian Dietrich
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org

Homepage: http://linguistlist.org

Please support the LL editors and operation with a donation at:
           https://funddrive.linguistlist.org/donate/

Editor for this issue: Everett Green <everett at linguistlist.org>
================================================================


Date: Thu, 17 Oct 2019 01:27:58
From: Membership Coordinator [ldc at ldc.upenn.edu]
Subject: October 2019 Newsletter - LDC

 
In this newsletter: 
Membership Year 2020 Publication Preview
LDC Data and Commercial Technology Development

New Publications:
BOLT English Treebank - Discussion Forum
Polish Speech Database
2016 NIST Speaker Recognition Evaluation Test Set


Membership Year 2020 Publication Preview
The 2020 Membership Year is just around the corner and plans for next year’s
publications are in progress. Among the expected releases are:

Abstract Meaning Representation (AMR) Annotation Release 3.0: semantic
treebank of over 59,000 English natural language sentences from broadcast
conversations, newswire, weblogs and web discussion forums; updates the second
version (LDC2017T10) with new annotations.
TAC KBP: English sentiment slot filling, surprise slot filling, nugget
detection and coreference, and event argument data in all languages (English,
Chinese, and Spanish)
DEFT Chinese ERE: Chinese discussion forum data annotated for entities,
relations, and events
LibriVox Spanish: 73 hours of Spanish audiobook read speech and transcripts 
IARPA Babel Language Packs (telephone speech and transcripts): languages
include Dhuluo, Javanese, and Mongolian
HAVIC Med Training data: web video, metadata, and annotations for developing
multimedia systems
RATS Speaker Identification: conversational telephone speech in Levantine
Arabic, Pashto, Urdu, Farsi and Dari on degraded audio signals with annotation
of speech segments for speaker identification
BOLT: discussion forums, SMS/chat, conversational telephone speech,
word-aligned, tagged and co-reference data in all languages (Chinese, Egyptian
Arabic, and English)

Check your inbox in the coming weeks for more information about membership
renewal.  

LDC data and commercial technology development
For-profit organizations are reminded that an LDC membership is a
pre-requisite for obtaining a commercial license to almost all LDC databases.
Non-member organizations, including non-member for-profit organizations,
cannot use LDC data to develop or test products for commercialization, nor can
they use LDC data in any commercial product or for any commercial purpose. LDC
data users should consult corpus-specific license agreements for limitations
on the use of certain corpora. Visit the Licensing page for further
information.


New publications:

(1) BOLT English Treebank - Discussion Forum was developed by LDC and consists
of 268,907 tokens of English web discussion forum data with part-of-speech and
syntactic structure annotations collected for the DARPA BOLT (Broad
Operational Language Translation) program.

Part-of-speech and treebank annotation conformed to Penn Treebank II style,
incorporating changes to those guidelines that were developed under the GALE
(Global Autonomous Language Exploitation) program. Supplementary guidelines
for English treebanks and web text are included with this release.

The source data is English discussion forum web text collected by LDC in 2011
and 2012. A subset of that data -- 702 files representing 268,907 tokens --
was selected for the treebank and annotated for word-level tokenization,
part-of-speech and syntactic structure. The unannotated English source data is
released as BOLT English Discussion Forums (LDC2017T11).

BOLT English Treebank - Discussion Forum is distributed via web download. 

2019 Subscription Members will automatically receive copies of this corpus.
2019 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee. 

*

(2) Polish Speech Database was developed by VoiceLab and consists of 263,424
utterances of Polish speech data from 200 speakers, totaling approximately 280
hours, and corresponding transcripts.

Data collection was performed in Poland. Speakers were asked to record
themselves reading text on a website for at least 60 minutes from their home
computer while using a headset. The read text was comprised of sentences
covering most speech sounds in Polish.

This release includes speaker metadata. There were 103 male speakers and 97
female speakers, ranging from 15 – 60 years of age; most speakers were in the
15 – 30 years age range.

Polish Speech Database is distributed via web download. 

2019 Subscription Members will automatically receive copies of this corpus.
2019 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee. 

*

(3) 2016 NIST Speaker Recognition Evaluation Test Set was developed by LDC and
NIST (National Institute of Standards and Technology) and contains
approximately 340 hours of short segments of Tagalog, Cantonese, Cebuano, and
Mandarin telephone speech used as development and test data in the
NIST-sponsored 2016 Speaker Recognition Evaluation (SRE).

As in previous evaluations, SRE16 focused on telephone speech recorded over a
variety of handset types for the training and test conditions. In addition to
development and evaluation data, this corpus also contains trial lists, their
associated keys, tables containing metadata information, and evaluation
documentation.

The telephone speech data was drawn from the Call My Net 2015 Corpus collected
by LDC. Native speakers of Tagalog, Cantonese, Cebuano, or Mandarin (220
unique speakers) made a total of ten telephone calls each to people within
their existing social networks. Speakers were encouraged to use different
telephone instruments in a variety of acoustic settings and were instructed to
talk for 8 - 10 minutes per call on a topic of their choice. All conversations
were collected outside North America.

2016 NIST Speaker Recognition Evaluation Test Set is distributed via web
download. 

2019 Subscription Members will automatically receive copies of this corpus.
2019 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee. 

*

Membership Office
Linguistic Data Consortium
University of Pennsylvania
T: +1-215-573-1275
E: ldc at ldc.upenn.edu
M: 3600 Market St. Suite 810
Philadelphia, PA 19104
 



Linguistic Field(s): Computational Linguistics





 



------------------------------------------------------------------------------

***************************    LINGUIST List Support    ***************************
 The 2019 Fund Drive is under way! Please visit https://funddrive.linguistlist.org
  to find out how to donate and check how your university, country or discipline
     ranks in the fund drive challenges. Or go directly to the donation site:
               https://iufoundation.fundly.com/the-linguist-list-2019

                        Let's make this a short fund drive!
                Please feel free to share the link to our campaign:
                    https://funddrive.linguistlist.org/donate/
 


----------------------------------------------------------
LINGUIST List: Vol-30-3914	
----------------------------------------------------------






More information about the LINGUIST mailing list