28.3812, FYI: LDC September 2017 Newsletter

The LINGUIST List linguist at listserv.linguistlist.org
Mon Sep 18 18:00:52 UTC 2017


LINGUIST List: Vol-28-3812. Mon Sep 18 2017. ISSN: 1069 - 4875.

Subject: 28.3812, FYI: LDC September 2017 Newsletter

Moderators: linguist at linguistlist.org (Damir Cavar, Malgorzata E. Cavar)
Reviews: reviews at linguistlist.org (Helen Aristar-Dry, Robert Coté,
                                   Michael Czerniakowski)
Homepage: http://linguistlist.org

Please support the LL editors and operation with a donation at:
           http://funddrive.linguistlist.org/donate/

Editor for this issue: Kenneth Steimel <ken at linguistlist.org>
================================================================


Date: Mon, 18 Sep 2017 14:00:41
From: Caitlin Fontecchio [ldc at ldc.upenn.edu]
Subject: LDC September 2017 Newsletter

 
In this newsletter: 

New Publications:

2015-2016 CoNLL Shared Task

IARPA Babel Zulu Language Pack IARPA-babel206b-v0.1e

SRI-FRTIV

Vehicle City Voices Corpus – Part I

New publications:

(1) 2015-2016 CoNLL Shared Task contains the Chinese and English training,
development and test data for the 2015 and 2016 CoNLL (Conference on
Computational Natural Language Learning) Shared Task Evaluation which focused
on shallow discourse parsing. This release consists of the tokenized, tagged,
and parsed tags in English and Chinese. The English train, dev and test data
are from Wall Street Journal material in Penn Discourse Treebank Version 2.0
(LDC2008T05); English blind test data are from wikinews. Chinese train, dev
and test data are news material from Chinese Discourse Treebank 0.5
(LDC2014T21); Chinese blind test data are from wikinews.

LDC has also released the following CoNLL Shared Task data sets:
- 2006 CoNLL Shared Task - Ten Languages (LDC2015T11)
- 2006 CoNLL Shared Task - Arabic & Czech (LDC2015T12)
- 2008 CoNLL Shared Task Data (LDC2009T12)
- 2009 CoNLL Shared Task Part 1 (LDC2012T03)
- 2009 CoNLL Shared Task Part 2 (LDC2012T04)

2015-2016 CoNLL Shared Task is distributed via web download.

2017 Subscription Members will receive copies of this corpus. 2017 Standard
Members may request a copy as part of their 16 free membership corpora.
Non-members may license this data for a fee.

(2) IARPA Babel Zulu Language Pack IARPA-babel206b-v0.1e was developed by
Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel
program. It contains approximately 211 hours of Zulu conversational and
scripted telephone speech collected in 2012 and 2013 along with corresponding
transcripts.

The Babel program focuses on underserved languages and seeks to develop speech
recognition technology that can be rapidly applied to any human language to
support keyword search performance over large amounts of recorded speech.

The Zulu speech in this release represents that spoken in the KZN
(KwaZulu-Natal)-urban dialect region of South Africa. The gender distribution
among speakers is approximately equal; speakers' ages range from 16 years to
70 years. Calls were made using different telephones (e.g., mobile, landline)
from a variety of environments including the street, a home or office, a
public place, and inside a vehicle.

IARPA Babel Zulu Language Pack IARPA-babel206b-v0.1e is distributed via web
download.

2017 Subscription Members will receive copies of this corpus provided they
have submitted a completed copy of the special license agreement. 2017
Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee. 

(3) SRI-FRTIV (Five-way Recorded Toastmaster Intrinsic Variation) was
developed by SRI International in 2007-2008 and is comprised of approximately
232 hours of English speech from thirty-four speakers who were members of
Toastmaster clubs. Participants were asked to speak at three different levels
of effort (low, normal and high) in four different styles (interview,
conversation, reading and oration) to study the question of how intrinsic
variations -- associated with the speaker rather than the recording
environment -- affect text-independent speaker verification.

Participants were native speakers of North American English who were members
of local Toastmasters clubs and had experience in public speaking. This
release includes demographic information for 30 speakers (15 male, 15 female),
including gender, birth year, height, education level, years in Toastmasters,
and a self-evaluation of speaking skills.

SRI-FRTIV is distributed via web download.

2017 Subscription Members will receive copies of this corpus. 2017 Standard
Members may request a copy as part of their 16 free membership corpora.
Non-members may license this data for a fee.

(4) Vehicle City Voices Corpus – Part I was developed at the University of
Michigan-Flint and is an ongoing oral history project and survey of English
language variation in Flint, Michigan. It contains approximately 16 hours of
speech with corresponding transcripts from interviews of Flint residents
conducted between 2012 and 2015. The corpus was designed to provide
high-quality recordings for acoustic analysis and to examine narrative
structure and discursive construction of individual and collective identity in
urban spaces.

This release is comprised of 21 interviews by undergraduate and graduate
students for civic engagement projects in linguistics courses and by a
graduate student research assistant. Participants (11 female, 10 male) were
born between 1935 and 1991 and represented a range of ages, genders, and
ethnicities. Of the interviewees, 11 were Black/African American, 8 were
White/Caucasian, and 2 were biracial/mixed ethnic heritage.

Metadata (where provided by participants) includes information on gender,
ethnicity, year of birth, level of education, field of employment, average
income, length of time living in Flint and its surrounding areas, as well as
interviewer age, gender, and ethnicity. In addition, original interview
durations, edited interview durations, interview year, and transcript word
counts are also provided in the metadata file.

Vehicle City Voices is distributed via web download.

2017 Subscription Members will receive copies of this corpus provided they
have submitted a completed copy of the special license agreement. 2017
Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee.

Membership Office
Linguistic Data Consortium
University of Pennsylvania
T: +1-215-573-1275
E: ldc at ldc.upenn.edu
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104
 



Linguistic Field(s): Computational Linguistics





 



------------------------------------------------------------------------------

*****************    LINGUIST List Support    *****************
Please support the LL editors and operation with a donation at:
            http://funddrive.linguistlist.org/donate/
 


----------------------------------------------------------
LINGUIST List: Vol-28-3812	
----------------------------------------------------------






More information about the LINGUIST mailing list