[Corpora-List] New Corpora from the LDC

Thu Jul 26 17:49:03 UTC 2007

LDC2007S05
*-  CSLU: Yes/No Version 1.2 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S05>  -*

LDC2007T24
*-  GALE Phase 1 Arabic Broadcast News Parallel Text - Part 1 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T24>  -*

LDC2007S09
*-  Mandarin Affective Speech 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S09>  -

*
------------------------------------------------------------------------
*
*
*New Publications

*

(1)  CSLU: Yes/No Version 1.2 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S05> 
is a collection of answers to yes/no questions from various telephone 
speech corpora created by the Center for Spoken Language Understanding, 
Oregon Health and Science University (CSLU). The corpus contains 
approximately 20,000 examples of roughly 18,000 speakers saying "yes" or 
"no" in response to various questions.

Each speech file in the corpus has a corresponding orthographic 
transcription following the CSLU Labeling Conventions. In cases where a 
transcription did not already exist, the utterance was run through a 
speech recognizer to automatically obtain the transcription.

The data were collected from both analog and digital phone lines. The 
analog data were recorded using a Gradient Technologies 
analog-to-digital conversion box. These files were recorded as 16-bit, 8 
kHz and stored in a linear format.

*

(2)  GALE Phase 1 Arabic Broadcast News Parallel Text - Part 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T24>1 
is the first part of the three-part GALE Phase 1 Arabic Broadcast News 
Parallel Text, which, along with other corpora, was used as training 
data in year 1 (Phase 1) of the DARPA-funded GALE program. This corpus 
contains transcripts and English translations of 17 hours of Arabic 
broadcast news programming selected from a variety of sources.  A manual 
selection procedure was used to choose data appropriate for the GALE 
program, namely, news and conversation programs focusing on current 
events. Stories on topics such as sports, entertainment news, and stock 
market reports were excluded from the data set.

The selected audio snippets were then carefully transcribed by LDC 
annotators and professional transcription agencies following LDC's Quick 
Rich Transcription specification. Manual sentence units/segments (SU) 
annotation was also performed as part of the transcription task. Three 
types of end of sentence SU are identified:

    * statement SU
    * question SU
    * incomplete SU

After transcription and SU annotation, the files were reformatted into a 
human-readable translation format and were then assigned to professional 
translators for careful translation. Translators followed LDC's GALE 
translation guidelines, which describe the makeup of the translation 
team, the source data format, the translation data format, best 
practices for translating certain linguistic features (such as names and 
speech disfluencies), and quality control procedures applied to 
completed translations.

All final data are in Tab Delimited Format (TDF). TDF is compatible with 
other transcription formats, such as the Transcriber format and AG 
format, and it is easy to process.  Each line of a TDF file corresponds 
to a speech segment and contains 13 tab delimited fields.  The source 
TDF file and its translation are the same except that the transcript in 
the source TDF is replaced by its English translation. 

*

(3)  Mandarin Affective Speech 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S09> 
is a database of emotional speech consisting of audio recordings and 
corresponding transcripts collected in 2005 at the Advance Computing and 
System Laboratory, College of Computer Science and Technology, Zhejiang 
University, Hangzhou, People's Republic of China. This corpus was 
designed with two goals: first, to serve as a tool for linguistic and 
prosodic feature investigation of emotional expression in Mandarin 
Chinese; and second, to provide a source of training and test data 
essential to support research in speaker recognition with affective 
speech. The speech database was recorded by eliciting speakers to 
express different emotional states in response to stimuli. The speakers 
read scenarios designed to elicit an emotional response.  The five 
emotional states recorded are characterized as follows:

    * Neutral - Simple statements without any emotion.
    * Anger - A strong feeling of displeasure or hostility.
    * Elation - Be glad or happy because of praise.
    * Panic - A sudden, overpowering terror, often affecting many people
      at once.
    * Sadness - Affected or characterized by sorrow or unhappiness

Recordings from 68 speakers (23 females, 45 males) were used in this 
corpus. Subjects were given a text to read that consisted of five 
phrases, fifteen sentences and two paragraphs designed to generate the 
emotional speech. The material included all the phonemes in Mandarin. 
Each subject read the phrases, paragraphs, and sentences portraying the 
five emotional states.  Altogether this database contains 25,636 
utterances. 

------------------------------------------------------------------------

Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------

*
Linguistic Data Consortium                     Phone: (215) 573-1275
University of Pennsylvania                       Fax: (215) 573-2175
3600 Market St., Suite 810                         ldc at ldc.upenn.edu
Philadelphia, PA 19104                      http://www.ldc.upenn.edu*

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20070726/a9d078f2/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora