27.3089, FYI: July 2016 Newsletter – LDC

Wed Jul 27 15:32:41 UTC 2016

LINGUIST List: Vol-27-3089. Wed Jul 27 2016. ISSN: 1069 - 4875.

Subject: 27.3089, FYI: July 2016 Newsletter – LDC

Moderators: linguist at linguistlist.org (Damir Cavar, Malgorzata E. Cavar)
Reviews: reviews at linguistlist.org (Anthony Aristar, Helen Aristar-Dry,
                                   Robert Coté, Michael Czerniakowski)
Homepage: http://linguistlist.org

*****************    LINGUIST List Support    *****************
                       Fund Drive 2016
                   25 years of LINGUIST List!
Please support the LL editors and operation with a donation at:
           http://funddrive.linguistlist.org/donate/

Editor for this issue: Ashley Parker <ashley at linguistlist.org>
================================================================

Date: Wed, 27 Jul 2016 11:32:33
From: LDC LDC [ldc at ldc.upenn.edu]
Subject: July 2016 Newsletter – LDC

In this Newsletter:
Fall 2016 Data Scholarship Program
2015 User Survey Results

New Publications:
English Speed Networking Conversational Transcripts

Digital Archive of Southern Speech - NLP Version (DASS)

GALE Phase 3 and 4 Chinese Broadcast News Parallel Text

IARPA Babel Cantonese Language Pack IARPA-babel101b-v0.4c

Fall 2016 Data Scholarship Program

Applications are now being accepted through Thursday, September 15, 2016 for
the Fall 2016 LDC Data Scholarship program. The LDC Data Scholarship program
provides university students with access to LDC data at no-cost.

This program is open to students pursuing both undergraduate and graduate
studies in an accredited college or university. LDC Data Scholarships are not
restricted to any particular field of study; however, students must
demonstrate a well-developed research agenda and a bona fide inability to pay.
The selection process is highly competitive.

The application consists of two parts:

(1) Data Use Proposal. Applicants must submit a two-page proposal describing
their intended use of the data. The proposal should state which data the
student plans to use, how the data will benefit their research project, the
proposed methodology or algorithm which will be used and how success will be
measured.

Applicants should consult the Catalog for a complete list of data distributed
by LDC. Due to certain restrictions, a handful of LDC corpora are restricted
to members of the Consortium. Applicants are advised to select a maximum of
one to two databases.

(2) Letter of Support. Applicants must submit one letter of support from their
thesis adviser or department chair. The letter must be signed and printed on
letterhead, describe the student and the research, evaluate the probability of
success and confirm that the department or university lacks the funding to pay
the full non-member fee for the data. 

For further information on application materials and program rules, please
visit the LDC Data Scholarship page.

2015 User Survey Results
LDC conducted its fourth user survey in December 2015. This survey built on
the previous surveys conducted in 2006, 2007 and 2012 to assess user sentiment
and also asked for the evaluation of key LDC-related topics including:

- Opinions on the new website and usability of the catalog
- Use and satisfaction with the enhanced user services and e-commerce system
- LDC’s Data Management Plan capabilities
- Suggestions for future publications and preferred data delivery methods
- Use of web services for data access and processing

Overall, survey respondents were satisfied with LDC’s data, membership
options, website, Catalog and enhanced user services. Participants cited the
top five most useful corpora received between 2012 and 2015 as OntoNotes
Release 5.0, TIMIT, TAC KBP Reference Knowledge Base, Penn Discourse Treebank
V 2.0, and Multi-Channel WSJ Audio. Three fourths of respondents prefer
digital delivery of data and the top three languages for current research
demands were identified as English, Chinese and Spanish. 
We thank everyone who participated in this survey. Responses will benefit the
future of the Consortium and will help LDC to better meet the needs of our
members and data licensees. 

New Publications

(1) English Speed Networking Conversational Transcripts was developed at the
University of the West of England and contains 388 transcripts of English
face-to-face and instant messaging conversations  about business ideas
collected in 2014 and 2015 from participants (undergraduate students) playing
different power roles.

This corpus was created to examine communication accommodation, specifically,
the ways in which an individual's linguistic style is affected by social power
and personality. The data was collected in two studies. In the first study, 40
participants had a series of paired five minute face-to-face conversations
playing either a high, low or neutral power role. The same procedure was
followed in the second study except that participants discussed business ideas
via instant messaging.

The face-to-face conversations were audio-recorded and transcribed verbatim. 

All transcripts are presented as UTF-8 plain text files.

English Speed Networking Conversational Transcripts is distributed via web
download.
2016 Subscription Members will automatically receive two copies of this
corpus. 2016 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for a fee.

(2) Digital Archive of Southern Speech - NLP Version (DASS-NLP) was developed
by LDC as an alternate version of Digital Archive of Southern Speech (DASS)
(LDC2012S03) suitable for natural language processing and human language
technology applications. Specifically, the original audio files have been
converted to 16kHz 16-bit flac compressed wav and file names have been
normalized to facilitate automatic processing.

DASS was developed by the University of Georgia. It is a subset of the
Linguistic Atlas of the Gulf States (LAGS), which is in turn part of the
Linguist Atlas Project (LAP). DASS-NLP contains approximately 366 hours of
English speech data from 30 female speakers and 34 male speakers, along with
associated metadata about the speakers, the recordings and maps in .jpeg
format relating to the recording locations.

LAP consists of a set of survey research projects about the words and
pronunciation of everyday American English, the largest project of its kind in
the United States. Interviews with thousands of native speakers across the
country have been carried out since 1929. LAGS surveyed the everyday speech of
Georgia, Tennessee, Florida, Alabama, Mississippi, Arkansas, Louisiana, and
Texas in a series of 914 audio-taped interviews conducted from 1968-1983. 

The speakers' average age is 61 years; there are 30 women and 34 men from the
Gulf States region represented in this release. The interviews cover common
topics such as family, the weather, household articles and activities,
agriculture and social conditions.   

Digital Archive of Southern Speech - NLP Version is distributed via web
download. 

2016 Not-for-Profit Subscription Members will automatically receive two copies
of this corpus. 2016 For-Profit Subscription Members will receive two copies
provided they have submitted a completed copy of the For-Profit Member User
License Agreement for Digital Archive of Southern Speech – NLP Version
(LDC2016S05). 2016 Standard Members may request a copy as part of their 16
free membership corpora. This data is being made available at no-cost for
non-member organizations under a research license. 

(3) GALE Phase 3 and 4 Chinese Broadcast News Parallel Text was developed by
LDC. Along with other corpora, the parallel text in this release comprised
training data for Phases 3 and 4 of the DARPA GALE (Global Autonomous Language
Exploitation) Program. This corpus contains Chinese source text and
corresponding English translations selected from broadcast news data collected
by LDC between 2006 and 2008 and transcribed and translated by LDC or under
its direction.

GALE Phase 3 and 4 Chinese Broadcast News Parallel Text includes 76
source-translation document pairs, comprising 614,608 tokens of Chinese source
text and its English translation. Data is drawn from 16 distinct Chinese
programs broadcast between 2006 and 2008 by China Central TV, a national and
international broadcaster in Mainland China and Phoenix TV, a Hong Kong-based
satellite television station. The programs in this release feature news
programs on current events topics.

The files in this release were transcribed by LDC staff and/or transcription
vendors under contract to LDC in accordance with the Quick Rich Transcription
guidelines developed by LDC. 

Source data and translations are distributed in TDF format. All data are
encoded in UTF-8.

GALE Phase 3 and 4 Chinese Broadcast News Parallel is distributed via web
download

2016 Subscription Members will automatically receive two copies of this
corpus. 2016 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for a fee.

(4) IARPA Babel Cantonese Language Pack IARPA-babel101b-v0.4c was developed by
Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel
program. It contains approximately 215 hours of Cantonese conversational and
scripted telephone speech collected in 2011 along with corresponding
transcripts.

The Babel program focuses on underserved languages and seeks to develop speech
recognition technology that can be rapidly applied to any human language to
support keyword search performance over large amounts of recorded speech.

The Cantonese speech in this release represents that spoken in the Chinese
provinces of Guangdong and Guangxi, and within those provinces, among five
dialect groups. The gender distribution among speakers is approximately even;
speakers' ages range from 16 years to 67 years. Calls were made using
different telephones (e.g., mobile, landline) from a variety of environments
including the street, a home or office, a public place, and inside a vehicle.

All audio data is presented as 8kHz 8-bit a-law encoded audio in sphere
format. Transcripts are available in two versions: simplified Chinese
characters and a romanization scheme based on the Yale system, both encoded in
UTF-8. 

IARPA Babel Cantonese Language Pack IARPA is distributed via web download

2016 Subscription Members will receive two copies of this corpus provided they
have submitted a completed copy of the IARPA User Agreement for Not-for-Profit
Members or the IARPA User Agreement for For-Profit Members. 2016 Standard
Members may request a copy as part of their 16 free membership corpora.
Non-members may license this data for a fee.

Linguistic Field(s): Computational Linguistics

------------------------------------------------------------------------------

*****************    LINGUIST List Support    *****************
                       Fund Drive 2016
Please support the LL editors and operation with a donation at:
            http://funddrive.linguistlist.org/donate/

        Thank you very much for your support of LINGUIST!

----------------------------------------------------------
LINGUIST List: Vol-27-3089	
----------------------------------------------------------