[Corpora-List] New from LDC

Fri Nov 22 17:40:24 UTC 2013

*-  Spring 2014 LDC Data Scholarship Program* <#scholar>  -

/New publications:/**
****
****- Chinese Treebank 8.0  - * <#ctb>*
*** <#ctb>*
*** <#ctb>*- CSC Deceptive Speech  -* <#csc>*** <#csc>
** <#csc>
** <#csc>****
------------------------------------------------------------------------
**

*Spring 2014 LDC Data Scholarship Program*

Applications are now being accepted through Wednesday, January 15, 2014, 
11:59PM EST for the Spring 20143 LDC Data Scholarship program! The LDC 
Data Scholarship program provides university students with access to LDC 
data at no-cost. During previous program cycles, LDC has awarded no-cost 
copies of LDC data to over 35 individual students and student research 
groups.

This program is open to students pursuing both undergraduate and 
graduate studies in an accredited college or university. LDC Data 
Scholarships are not restricted to any particular field of study; 
however, students must demonstrate a well-developed research agenda and 
a bona fide inability to pay. The selection process is highly competitive.

The application consists of two parts:

(1) Data Use Proposal. Applicants must submit a proposal describing 
their intended use of the data. The proposal should state which data the 
student plans to use and how the data will benefit their research 
project as well as information on the proposed methodology or algorithm.

Applicants should consult the LDC Catalog 
<http://catalog.ldc.upenn.edu/>for a complete list of data distributed 
by LDC. Due to certain restrictions, a handful of LDC corpora are 
restricted to members of the Consortium. Applicants are advised to 
select a maximum of one to two datasets; students may apply for 
additional datasets during the following cycle once they have completed 
processing of the initial datasets and publish or present work in some 
juried venue.

(2) Letter of Support. Applicants must submit one letter of support from 
their thesis adviser or department chair. The letter must verify the 
student's need for data and confirm that the department or university 
lacks the funding to pay the full Non-member Fee for the data or to join 
the Consortium.

For further information on application materials and program rules, 
please visit the LDC Data Scholarship 
<https://www.ldc.upenn.edu/language-resources/data/data-scholarships>page.

Students can email their applications to the LDC Data Scholarship 
program <mailto:datascholarships at ldc.upenn.edu>. Decisions will be sent 
by email from the same address.

The deadline for the Spring 2014 program cycle is January 15, 2014, 
11:59PM EST.

*New publications*

(1) Chinese Treebank 8.0 
<http://catalog.ldc.upenn.edu/LDC2013T21>consists of approximately 1.5 
million words of annotated and parsed text from Chinese newswire, 
government documents, magazine articles, various broadcast news and 
broadcast conversation programs, web newsgroups and weblogs.

The Chinese Treebank project began at the University of Pennsylvania in 
1998, continued at the University of Colorado and then moved to Brandeis 
University <http://www.cs.brandeis.edu/%7Ellc/page2/page2.html>. The 
project's goal is to provide a large, part-of-speech tagged and fully 
bracketed Chinese language corpus. The first delivery, Chinese Treebank 
1.0, contained 100,000 syntactically annotated words from Xinhua News 
Agency newswire. It was later corrected and released in 2001 as Chinese 
Treebank 2.0 (LDC2001T11) <http://catalog.ldc.upenn.edu/LDC2001T11>and 
consisted of approximately 100,000 words. The LDC released Chinese 
Treebank 4.0 (LDC2004T05) <http://catalog.ldc.upenn.edu/LDC2004T05>, an 
updated version containing roughly 400,000 words, in 2004. A year later, 
LDC published the 500,000 word Chinese Treebank 5.0 (LDC2005T01) 
<http://catalog.ldc.upenn.edu/LDC2005T01>. Chinese Treebank 6.0 
(LDC2007T36) <http://catalog.ldc.upenn.edu/LDC2007T36>, released in 
2007, consisted of 780,000 words. Chinese Treebank 7.0 (LDC2010T08) 
<http://catalog.ldc.upenn.edu/LDC2010T07>, released in 2010, added new 
annotated newswire data, broadcast material and web text to the 
approximate total of one million words. Chinese Treebank 8.0 adds new 
annotated data from newswire, magazine articles and government documents.

There are 3,007 text files in this release, containing 71,369 sentences, 
1,620,561 words, 2,589,848 characters (hanzi or foreign). The data is 
provided in UTF-8 encoding, and the annotation has Penn Treebank-style 
labeled brackets. Details of the annotation standard can be found in the 
segmentation, POS-tagging and bracketing guidelines included in the 
release. The data is provided in four different formats: raw text, word 
segmented, POS-tagged, and syntactically bracketed formats. All files 
were automatically verified and manually checked.

*

(2) CSC Deceptive Speech <http://catalog.ldc.upenn.edu/LDC2013S09>was 
developed by Columbia University, SRI International and University of 
Colorado Boulder. It consists of 32 hours of audio interview from 32 
native speakers of Standard American English (16 male, 16 female) 
recruited from the Columbia University student population and the 
community. The purpose of the study was to distinguish deceptive speech 
from non-deceptive speech using machine learning techniques on extracted 
features from the corpus.

The participants were told that they were participating in a 
communication experiment which sought to identify people who fit the 
profile of the top entrepreneurs in America. To this end, the 
participants performed tasks and answered questions in six areas. Tthey 
were later told that they had received low scores in some of those areas 
and did not fit the profile. The subjects then participated in an 
interview where they were told to convince the interviewer that they had 
actually achieved high scores in all areas and that they did indeed fit 
the profile. The task of the interviewer was to determine how he thought 
the subjects had actually performed, and he was allowed to ask them any 
questions other than those that were part of the performed tasks. For 
each question from the interviewer, subjects were asked to indicate 
whether the reply was true or contained any false information by 
pressing one of two pedals hidden from the interviewer under a table.

Interviews were conducted in a double-walled sound booth and recorded to 
digital audio tape on two channels using Crown CM311A Differoid headworn 
close-talking microphones, then down sampled to 16kHz before processing.

The interviews were orthographically transcribed by hand using the NIST 
EARS transcription guidelines. Labels for local lies were obtained 
automatically from the pedal-press data and hand-corrected for 
alignment, and labels for global lies were annotated during 
transcription based on the known scores of the subjects versus their 
reported scores. The orthographic transcription was force-aligned using 
the SRI telephone speech recognizer adapted for full-bandwidth 
recordings. There are several segmentations associated with the corpus: 
the implicit segmentation of the pedal presses, derived 
semi-automatically sentence-like units (EARS SLASH-UNITS or SUs) which 
were hand labeled, intonational phrase units and the units corresponding 
to each topic of the interview.

------------------------------------------------------------------------

-- 
--

Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium                  Phone: 1 (215) 573-1275
University of Pennsylvania                    Fax: 1 (215) 573-2175
3600 Market St., Suite 810ldc at ldc.upenn.edu
Philadelphia, PA 19104 USAhttp://www.ldc.upenn.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20131122/42c70261/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora