[Corpora-List] New from LDC
Linguistic Data Consortium
ldc at ldc.upenn.edu
Fri Nov 22 17:40:24 UTC 2013
*- Spring 2014 LDC Data Scholarship Program* <#scholar> -
/New publications:/**
****
****- Chinese Treebank 8.0 - * <#ctb>*
*** <#ctb>*
*** <#ctb>*- CSC Deceptive Speech -* <#csc>*** <#csc>
** <#csc>
** <#csc>****
------------------------------------------------------------------------
**
*Spring 2014 LDC Data Scholarship Program*
Applications are now being accepted through Wednesday, January 15, 2014,
11:59PM EST for the Spring 20143 LDC Data Scholarship program! The LDC
Data Scholarship program provides university students with access to LDC
data at no-cost. During previous program cycles, LDC has awarded no-cost
copies of LDC data to over 35 individual students and student research
groups.
This program is open to students pursuing both undergraduate and
graduate studies in an accredited college or university. LDC Data
Scholarships are not restricted to any particular field of study;
however, students must demonstrate a well-developed research agenda and
a bona fide inability to pay. The selection process is highly competitive.
The application consists of two parts:
(1) Data Use Proposal. Applicants must submit a proposal describing
their intended use of the data. The proposal should state which data the
student plans to use and how the data will benefit their research
project as well as information on the proposed methodology or algorithm.
Applicants should consult the LDC Catalog
<http://catalog.ldc.upenn.edu/>for a complete list of data distributed
by LDC. Due to certain restrictions, a handful of LDC corpora are
restricted to members of the Consortium. Applicants are advised to
select a maximum of one to two datasets; students may apply for
additional datasets during the following cycle once they have completed
processing of the initial datasets and publish or present work in some
juried venue.
(2) Letter of Support. Applicants must submit one letter of support from
their thesis adviser or department chair. The letter must verify the
student's need for data and confirm that the department or university
lacks the funding to pay the full Non-member Fee for the data or to join
the Consortium.
For further information on application materials and program rules,
please visit the LDC Data Scholarship
<https://www.ldc.upenn.edu/language-resources/data/data-scholarships>page.
Students can email their applications to the LDC Data Scholarship
program <mailto:datascholarships at ldc.upenn.edu>. Decisions will be sent
by email from the same address.
The deadline for the Spring 2014 program cycle is January 15, 2014,
11:59PM EST.
*New publications*
(1) Chinese Treebank 8.0
<http://catalog.ldc.upenn.edu/LDC2013T21>consists of approximately 1.5
million words of annotated and parsed text from Chinese newswire,
government documents, magazine articles, various broadcast news and
broadcast conversation programs, web newsgroups and weblogs.
The Chinese Treebank project began at the University of Pennsylvania in
1998, continued at the University of Colorado and then moved to Brandeis
University <http://www.cs.brandeis.edu/%7Ellc/page2/page2.html>. The
project's goal is to provide a large, part-of-speech tagged and fully
bracketed Chinese language corpus. The first delivery, Chinese Treebank
1.0, contained 100,000 syntactically annotated words from Xinhua News
Agency newswire. It was later corrected and released in 2001 as Chinese
Treebank 2.0 (LDC2001T11) <http://catalog.ldc.upenn.edu/LDC2001T11>and
consisted of approximately 100,000 words. The LDC released Chinese
Treebank 4.0 (LDC2004T05) <http://catalog.ldc.upenn.edu/LDC2004T05>, an
updated version containing roughly 400,000 words, in 2004. A year later,
LDC published the 500,000 word Chinese Treebank 5.0 (LDC2005T01)
<http://catalog.ldc.upenn.edu/LDC2005T01>. Chinese Treebank 6.0
(LDC2007T36) <http://catalog.ldc.upenn.edu/LDC2007T36>, released in
2007, consisted of 780,000 words. Chinese Treebank 7.0 (LDC2010T08)
<http://catalog.ldc.upenn.edu/LDC2010T07>, released in 2010, added new
annotated newswire data, broadcast material and web text to the
approximate total of one million words. Chinese Treebank 8.0 adds new
annotated data from newswire, magazine articles and government documents.
There are 3,007 text files in this release, containing 71,369 sentences,
1,620,561 words, 2,589,848 characters (hanzi or foreign). The data is
provided in UTF-8 encoding, and the annotation has Penn Treebank-style
labeled brackets. Details of the annotation standard can be found in the
segmentation, POS-tagging and bracketing guidelines included in the
release. The data is provided in four different formats: raw text, word
segmented, POS-tagged, and syntactically bracketed formats. All files
were automatically verified and manually checked.
*
(2) CSC Deceptive Speech <http://catalog.ldc.upenn.edu/LDC2013S09>was
developed by Columbia University, SRI International and University of
Colorado Boulder. It consists of 32 hours of audio interview from 32
native speakers of Standard American English (16 male, 16 female)
recruited from the Columbia University student population and the
community. The purpose of the study was to distinguish deceptive speech
from non-deceptive speech using machine learning techniques on extracted
features from the corpus.
The participants were told that they were participating in a
communication experiment which sought to identify people who fit the
profile of the top entrepreneurs in America. To this end, the
participants performed tasks and answered questions in six areas. Tthey
were later told that they had received low scores in some of those areas
and did not fit the profile. The subjects then participated in an
interview where they were told to convince the interviewer that they had
actually achieved high scores in all areas and that they did indeed fit
the profile. The task of the interviewer was to determine how he thought
the subjects had actually performed, and he was allowed to ask them any
questions other than those that were part of the performed tasks. For
each question from the interviewer, subjects were asked to indicate
whether the reply was true or contained any false information by
pressing one of two pedals hidden from the interviewer under a table.
Interviews were conducted in a double-walled sound booth and recorded to
digital audio tape on two channels using Crown CM311A Differoid headworn
close-talking microphones, then down sampled to 16kHz before processing.
The interviews were orthographically transcribed by hand using the NIST
EARS transcription guidelines. Labels for local lies were obtained
automatically from the pedal-press data and hand-corrected for
alignment, and labels for global lies were annotated during
transcription based on the known scores of the subjects versus their
reported scores. The orthographic transcription was force-aligned using
the SRI telephone speech recognizer adapted for full-bandwidth
recordings. There are several segmentations associated with the corpus:
the implicit segmentation of the pedal presses, derived
semi-automatically sentence-like units (EARS SLASH-UNITS or SUs) which
were hand labeled, intonational phrase units and the units corresponding
to each topic of the interview.
------------------------------------------------------------------------
--
--
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: 1 (215) 573-1275
University of Pennsylvania Fax: 1 (215) 573-2175
3600 Market St., Suite 810ldc at ldc.upenn.edu
Philadelphia, PA 19104 USAhttp://www.ldc.upenn.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20131122/42c70261/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list