28.885, FYI: News from LDC

The LINGUIST List linguist at listserv.linguistlist.org
Wed Feb 15 20:19:26 UTC 2017


LINGUIST List: Vol-28-885. Wed Feb 15 2017. ISSN: 1069 - 4875.

Subject: 28.885, FYI: News from LDC

Moderators: linguist at linguistlist.org (Damir Cavar, Malgorzata E. Cavar)
Reviews: reviews at linguistlist.org (Helen Aristar-Dry, Robert Coté,
                                   Michael Czerniakowski)
Homepage: http://linguistlist.org

Editor for this issue: Yue Chen <yue at linguistlist.org>
================================================================


Date: Wed, 15 Feb 2017 15:19:18
From: Katie Kindle [ldc at ldc.upenn.edu]
Subject: News from LDC

 
LDC Director Mark Liberman receives the IEEE James L. Flanagan Speech and
Audio Processing Award

Only two weeks left to enjoy 2017 membership discounts

Spring 2016 LDC Data Scholarship recipients

New publications:

First-Year Law Students' Court Memoranda
IARPA Babel Haitian Creole Language Pack IARPA-babel201b-v0.2b
GALE Phase 3 Arabic Broadcast News Speech Part 2 
GALE Phase 3 Arabic Broadcast News Transcripts Part 2

LDC Director Mark Liberman receives the IEEE James L. Flanagan Speech and
Audio Processing Award
 
LDC Director Mark Liberman is the 2017 recipient of the IEEE James L. Flanagan
Speech and Audio Processing Award. Established in 2002, this annual award
recognizes an individual for his or her outstanding contribution to the
advancement of speech and/or audio processing. Liberman’s pioneering
contributions and continued leadership in robust, replicable, and data-driven
speech and language science and engineering have fueled the development and
advancement of human language technologies including speech and speaker
recognition, machine translation, and semantic analysis. As LDC’s founder,
Mark has shepherded the Consortium from a small organization to the largest
developer of shared language resources, distributing more than 120,000 copies
of over 2,000 databases covering 91 different languages to more than 3,600
organizations in over 70 countries.
 
Liberman will receive the award at ICASSP 2017 in New Orleans (March 5-9). LDC
will be an exhibitor at Booth 43. Please stop by and say hello. We hope to see
you there.    

Only two weeks left to enjoy 2017 membership discounts

There is still time to save on 2017 membership fees. Through March 1, all
organizations receive a discount on the 2017 membership fee (up to 10%) when
they choose to join or renew.   

For more information on membership benefits, visit Join LDC. 

Spring 2016 LDC Data Scholarship recipients

Congratulations to the recipients of LDC's Spring 2017 data scholarship:

Umad Ul Hassan and Muhammad Awais Zulfiqar: National University of Sciences
and Technology (Pakistan); BS Computer Science. Hassan and Zulfiqar are
awarded copies of CSLU: Kids’ Speech Version 1.1 and The CMU Kids Corpus for
their research in speech recognition for children with learning difficulties.
 
For information about the program, visit the Data Scholarship page. 

New publications:

- First-Year Law Students' Court Memoranda consists of 197 English law student
writing samples of legal briefs annotated for certain characteristics along
with accompanying survey responses by student writers.

The briefs were created in a law school writing class at two law schools in
the US Midwest during the 2011-12 academic year. Students who agreed to
participate in this study uploaded their briefs to an online survey instrument
and answered questions regarding their age, gender, level of education, most
recent writing course and method of learning English. The study's purpose was
to apply natural language processing approaches to determine any differences
in the briefs' language attributable to the students' self-reported genders.

The samples were imported into the General Architecture for Text Engineering
(GATE) and annotated by two human coders who identified large text segments
specific to the legal genre in which the students wrote, such as text
headings, citations, block quotes and footnotes.

Writing samples are presented as MS Word documents and annotations and survey
responses are presented in XML format. The data has been anonymized to remove
names and other identifying information about the student participants.

First-Year Law Students' Court Memoranda is distributed via web download.

2017 Subscription Members will receive copies of this corpus provided they
have submitted a completed copy of the special license agreement. 2017
Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee. 

- IARPA Babel Haitian Creole Language Pack IARPA-babel 201b-v0.2b was
developed by Appen for the IARPA (Intelligence Advanced Research Projects
Activity) Babel program. It contains approximately 203 hours of Haitian Creole
conversational and scripted telephone speech collected in 2012 and 2013 along
with corresponding transcripts.

The Babel program focuses on underserved languages and seeks to develop speech
recognition technology that can be rapidly applied to any human language to
support keyword search performance over large amounts of recorded speech.

The Haitian Creole speech in this release represents that spoken in the
Northern, Western and Southern dialect regions in Haiti. The gender
distribution among speakers is approximately equal; speakers' ages range from
16 years to 75 years. Calls were made using different telephones (e.g.,
mobile, landline) from a variety of environments including the street, a home
or office, a public place, and inside a vehicle.

Transcripts are encoded in UTF-8. 

IARPA Babel Haitian Creole Language Pack IARPA-babel201b-v0.2b is distributed
via web download.

2017 Subscription Members will receive copies of this corpus provided they
have submitted a completed copy of the special license agreement. 2017
Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee. 

- GALE Phase 3 Arabic Broadcast News Speech Part 2 was developed by LDC and is
comprised of approximately 128 hours of Arabic broadcast conversation speech
collected in 2007 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco
during Phase 3 of the DARPA GALE (Global Autonomous Language Exploitation)
program.

Corresponding transcripts are released as GALE Phase 3 Arabic Broadcast News
Transcripts Part 2 (LDC2017T04).

The recordings in this corpus feature news broadcasts focusing principally on
current events from various broadcast programmers including Abu Dhabi TV, Al
Alam News Channel, Al Arabiya, Al Iraqiyah, Aljazeera, Al Ordiniyah, Dubai TV,
Kuwait TV, Lebanese Broadcast Corporation, Nile TV, Saudi TV and Syria TV. 
 
This release contains 175 audio files presented in FLAC-compressed Waveform
Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was
audited by a native Arabic speaker. 

GALE Phase 3 Arabic Broadcast News Speech Part 2 is distributed via web
download. 

2017 Subscription Members will automatically receive copies of this corpus.
2017 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee. 

- GALE Phase 3 Arabic Broadcast News Transcripts Part 2 was developed by LDC
and contains transcriptions of approximately 128 hours of Arabic broadcast
news speech collected in 2007 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat,
Morocco during Phase 3 of the DARPA GALE (Global Autonomous Language
Exploitation) program.
 
Corresponding audio data is released as GALE Phase 3 Arabic Broadcast News
Speech Part 2 (LDC2017S02). 

The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8
encoding, and the transcribed data totals 721,846 tokens. The transcripts were
created with the LDC tool, XTrans, which supports manual transcription and
annotation of audio recordings. 
 
The files in this corpus were transcribed by LDC staff and/or by transcription
vendors under contract to LDC. Transcribers followed LDC's quick transcription
guidelines (QTR) and quick rich transcription specification (QRTR) both of
which are included in the documentation with this release. 
 
GALE Phase 3 Arabic Broadcast News Transcripts Part 2 is distributed via web
download. 

2017 Subscription Members will automatically receive copies of this corpus.
2017 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee.

Membership Office
Linguistic Data Consortium
University of Pennsylvania
T: +1-215-573-1275
E: ldc at ldc.upenn.edu
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104
 



Linguistic Field(s): Computational Linguistics





 



----------------------------------------------------------
LINGUIST List: Vol-28-885	
----------------------------------------------------------







More information about the LINGUIST mailing list