30.2767, FYI: July 2019 Newsletter - LDC

The LINGUIST List linguist at listserv.linguistlist.org
Mon Jul 15 21:40:12 UTC 2019


LINGUIST List: Vol-30-2767. Mon Jul 15 2019. ISSN: 1069 - 4875.

Subject: 30.2767, FYI: July 2019 Newsletter - LDC

Moderator: Malgorzata E. Cavar (linguist at linguistlist.org)
Student Moderator: Jeremy Coburn
Managing Editor: Becca Morris
Team: Helen Aristar-Dry, Everett Green, Sarah Robinson, Peace Han, Nils Hjortnaes, Yiwen Zhang, Julian Dietrich
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org

Homepage: http://linguistlist.org

Please support the LL editors and operation with a donation at:
           https://funddrive.linguistlist.org/donate/

Editor for this issue: Everett Green <everett at linguistlist.org>
================================================================


Date: Mon, 15 Jul 2019 17:38:31
From: Membership Coordinator [ldc at ldc.upenn.edu]
Subject: July 2019 Newsletter - LDC

 
In this newsletter:
Fall 2019 LDC Data Scholarship Program
LDC data and commercial technology development

New Publications:

The DKU-JNU-EMA Electromagnetic Articulography Database
Phrase Detectives Corpus Version 2
First DIHARD Challenge Evaluation - Nine Sources
First DIHARD Challenge Evaluation – SEEDLingS

Fall 2019 LDC Data Scholarship Program
Student applications for the Fall 2019 LDC Data Scholarship program are being
accepted now through September 15, 2019. This scholarship program provides
eligible students with access to LDC data at no cost. Students must complete
an application consisting of a data use proposal and letter of support from
their advisor.

For application requirements and program rules, please visit the LDC Data
Scholarship page. 

LDC data and commercial technology development
For-profit organizations are reminded that an LDC membership is a
pre-requisite for obtaining a commercial license to almost all LDC databases.
Non-member organizations, including non-member for-profit organizations,
cannot use LDC data to develop or test products for commercialization, nor can
they use LDC data in any commercial product or for any commercial purpose. LDC
data users should consult corpus-specific license agreements for limitations
on the use of certain corpora. Visit the Licensing page for further
information.

New publications:

(1) The DKU-JNU-EMA Electromagnetic Articulography Database was developed by
Duke Kunshan University and Jinan University and contains approximately 10
hours of articulography and speech data in Mandarin, Cantonese, Hakka, and
Teochew Chinese from two to seven native speakers for each dialect.

Articulatory measurements were made using the NDI electromagnetic
articulography wave research system to capture real-time vocal tract variable
trajectories. Subjects had six sensors placed in various locations in their
mouth and one reference sensor was placed on the bridge of their nose. For
simultaneous recording of speech signals, subjects also wore a head-mounted
close-talk microphone. 

Speakers engaged in four different types of recording sessions: one in which
they read complete sentences or short texts, and three sessions in which they
read related words of a specific common consonant, vowel, or tone.

DKU-JNU-EMA Electromagnetic Articulography Database is distributed via web
download. 

2019 Subscription Members will automatically receive copies of this corpus.
2019 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee. 

*

(2) Phrase Detectives Corpus Version 2 was developed by the School of Computer
Science and Electronic Engineering at the University of Essex and consists of
approximately 407,000 tokens across 537 documents anaphorically-annotated by
the Phrase Detectives Game, an online interactive ''game-with-a-purpose''
(GWAP) designed to collect data about English anaphoric coreference. 

This release constitutes a new version of the Phrase Detectives Corpus
(LDC2017T08), adding significantly more annotated tokens to the data set and
supplying players’ judgments and a silver label annotation based on the
probabilistic aggregation method for anaphoric information for each markable. 

The documents in the corpus are taken from Wikipedia articles and from
narrative text in Project Gutenberg. The annotation is a simplified form of
the coding scheme used in The ARRAU Corpus of Anaphoric Information
(LDC2013T22).

Phrase Detectives Corpus Version 2 is distributed via web download. 

2019 Subscription Members will automatically receive copies of this corpus.
2019 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data at no cost.

*

(3) First DIHARD Challenge Evaluation - Nine Sources was developed by LDC and
contains approximately 18 hours of English and Chinese speech data along with
corresponding annotations used in support of the First DIHARD Challenge. 

The First DIHARD Challenge was an attempt to reinvigorate work on diarization
through a shared task focusing on ''hard'' diarization; that is, speech
diarization for challenging corpora where there was an expectation that
existing state-of-the-art systems would fare poorly. As such, it included
speech from a wide sampling of domains representing diversity in number of
speakers, speaker demographics, interaction style, recording quality, and
environmental conditions as follows (all sources are in English unless
otherwise indicated):

- Autism Diagnostic Observation Schedule (ADOS) interviews
- Conversations in Restaurants
- DCIEM/HCRC map task (LDC96S38)
- Audiobook recordings from LibriVox
- Meeting speech collected by LDC in 2001 for the ROAR project (see, e.g., ISL
Meeting Speech Part 1 (LDC2004S05))
- 2001 U.S. Supreme Court oral arguments
- Mixer 6 Speech (LDC2013S02)
- Chinese video collected by LDC as part of the Video Annotation for Speech
Technologies (VAST) project
- YouthPoint radio interviews

This release, when combined with First DIHARD Challenge Evaluation - SEEDLingS
(LDC2019S13), contains the evaluation set audio data and annotation as well as
the official scoring tool. The development data for the First DIHARD Challenge
is also available from LDC as Eight Sources (LDC2019S09) and SEEDLingS
(LDC2019S10).

First DIHARD Challenge Evaluation - Nine Sources is distributed via web
download. 

2019 Subscription Members will automatically receive copies of this corpus.
2019 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee.

*

(4) First DIHARD Challenge Evaluation – SEEDLingS was developed by Duke
University and LDC and contains approximately two hours of English child
language recordings along with corresponding annotations used in support of
the First DIHARD Challenge. 

The source data was drawn from the SEEDLingS (The Study of Environmental
Effects on Developing Linguistic Skills) corpus, designed to investigate how
infants' early linguistic and environmental input plays a role in their
learning. Recordings for SEEDLingS were generated in the home environment of
44 infants from 6-18 months of age in the Rochester, New York area. A subset
of that data was annotated by LDC for use in the First DIHARD Challenge.

This release, when combined with First DIHARD Challenge Evaluation - Nine
Sources (LDC2019S12), contains the evaluation set audio data and annotation as
well as the official scoring tool. The development data for the First DIHARD
Challenge is also available from LDC as Eight Sources (LDC2019S09) and
SEEDLingS (LDC2019S10).

First DIHARD Challenge Evaluation – SEEDLingS is distributed via web download.

2019 Subscription Members will receive copies of this corpus provided they
have submitted a completed copy of the special license agreement. 2019
Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee.

*

Membership Office
Linguistic Data Consortium
University of Pennsylvania
T: +1-215-573-1275
E: ldc at ldc.upenn.edu
M: 3600 Market St. Suite 810
Philadelphia, PA 19104
 



Linguistic Field(s): Computational Linguistics





 



------------------------------------------------------------------------------

***************************    LINGUIST List Support    ***************************
 The 2019 Fund Drive is under way! Please visit https://funddrive.linguistlist.org
  to find out how to donate and check how your university, country or discipline
     ranks in the fund drive challenges. Or go directly to the donation site:
               https://iufoundation.fundly.com/the-linguist-list-2019

                        Let's make this a short fund drive!
                Please feel free to share the link to our campaign:
                    https://funddrive.linguistlist.org/donate/
 


----------------------------------------------------------
LINGUIST List: Vol-30-2767	
----------------------------------------------------------






More information about the LINGUIST mailing list