[Corpora-List] New Corpora from the LDC
Linguistic Data Consortium
ldc at ldc.upenn.edu
Thu Jul 26 17:49:03 UTC 2007
LDC2007S05
*- CSLU: Yes/No Version 1.2
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S05> -*
LDC2007T24
*- GALE Phase 1 Arabic Broadcast News Parallel Text - Part 1
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T24> -*
LDC2007S09
*- Mandarin Affective Speech
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S09> -
*
------------------------------------------------------------------------
*
*
*New Publications
*
(1) CSLU: Yes/No Version 1.2
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S05>
is a collection of answers to yes/no questions from various telephone
speech corpora created by the Center for Spoken Language Understanding,
Oregon Health and Science University (CSLU). The corpus contains
approximately 20,000 examples of roughly 18,000 speakers saying "yes" or
"no" in response to various questions.
Each speech file in the corpus has a corresponding orthographic
transcription following the CSLU Labeling Conventions. In cases where a
transcription did not already exist, the utterance was run through a
speech recognizer to automatically obtain the transcription.
The data were collected from both analog and digital phone lines. The
analog data were recorded using a Gradient Technologies
analog-to-digital conversion box. These files were recorded as 16-bit, 8
kHz and stored in a linear format.
*
(2) GALE Phase 1 Arabic Broadcast News Parallel Text - Part
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T24>1
is the first part of the three-part GALE Phase 1 Arabic Broadcast News
Parallel Text, which, along with other corpora, was used as training
data in year 1 (Phase 1) of the DARPA-funded GALE program. This corpus
contains transcripts and English translations of 17 hours of Arabic
broadcast news programming selected from a variety of sources. A manual
selection procedure was used to choose data appropriate for the GALE
program, namely, news and conversation programs focusing on current
events. Stories on topics such as sports, entertainment news, and stock
market reports were excluded from the data set.
The selected audio snippets were then carefully transcribed by LDC
annotators and professional transcription agencies following LDC's Quick
Rich Transcription specification. Manual sentence units/segments (SU)
annotation was also performed as part of the transcription task. Three
types of end of sentence SU are identified:
* statement SU
* question SU
* incomplete SU
After transcription and SU annotation, the files were reformatted into a
human-readable translation format and were then assigned to professional
translators for careful translation. Translators followed LDC's GALE
translation guidelines, which describe the makeup of the translation
team, the source data format, the translation data format, best
practices for translating certain linguistic features (such as names and
speech disfluencies), and quality control procedures applied to
completed translations.
All final data are in Tab Delimited Format (TDF). TDF is compatible with
other transcription formats, such as the Transcriber format and AG
format, and it is easy to process. Each line of a TDF file corresponds
to a speech segment and contains 13 tab delimited fields. The source
TDF file and its translation are the same except that the transcript in
the source TDF is replaced by its English translation.
*
(3) Mandarin Affective Speech
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S09>
is a database of emotional speech consisting of audio recordings and
corresponding transcripts collected in 2005 at the Advance Computing and
System Laboratory, College of Computer Science and Technology, Zhejiang
University, Hangzhou, People's Republic of China. This corpus was
designed with two goals: first, to serve as a tool for linguistic and
prosodic feature investigation of emotional expression in Mandarin
Chinese; and second, to provide a source of training and test data
essential to support research in speaker recognition with affective
speech. The speech database was recorded by eliciting speakers to
express different emotional states in response to stimuli. The speakers
read scenarios designed to elicit an emotional response. The five
emotional states recorded are characterized as follows:
* Neutral - Simple statements without any emotion.
* Anger - A strong feeling of displeasure or hostility.
* Elation - Be glad or happy because of praise.
* Panic - A sudden, overpowering terror, often affecting many people
at once.
* Sadness - Affected or characterized by sorrow or unhappiness
Recordings from 68 speakers (23 females, 45 males) were used in this
corpus. Subjects were given a text to read that consisted of five
phrases, fifteen sentences and two paragraphs designed to generate the
emotional speech. The material included all the phonemes in Mandarin.
Each subject read the phrases, paragraphs, and sentences portraying the
five emotional states. Altogether this database contains 25,636
utterances.
------------------------------------------------------------------------
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
*
Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 http://www.ldc.upenn.edu*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20070726/a9d078f2/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list