[Corpora-List] New from the LDC
Linguistic Data Consortium
ldc at ldc.upenn.edu
Mon Sep 24 21:29:38 UTC 2007
The Linguistic Data Consortium (LDC) is pleased to announce the
availability of three new publications.
LDC2007S13
- *CSLU: Apple Words and Phrases
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S13>* -
LDC2007T23
- *GALE Phase 1 Chinese Broadcast News Parallel Text - Part 1
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T23>* -
LDC2007S15
- *Nationwide Speech Project
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S15>* -
------------------------------------------------------------------------
*New Publications*
(1) CSLU: Apple Words and Phrases
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S13>
contains approximately 69.5 hours of speech from 3008 telephone calls
placed on analog and digital phone systems. Apple Computer, Inc.
supported the development of this data and also supplied the list of
words and phrases collected. Callers responded to questions and
repeated a list of phrases as they were prompted. Each subject called
the CSLU data collection system by dialing a toll-free number. The
analog data were collected via a Worldport Pod on an Apple Quadra A/V.
The digital data were collected with the CSLU T1 digital data collection
system.
Callers were prompted to answer certain questions including, What is
your native language? In which city and state did you spend most of your
childhood? What time is it now? What day is today? Callers were also
instructed to repeat various command and control type phrases, including
"play previous message again", "make a meeting for today", "quit", "who
is at work", "what is the area code for this state", "hello, what are my
messages", "help", "please send a car from the city", "delete my email
tomorrow", "read this text", "erase all information", "record extended
phonebook", "transfer all calls to home at twelve o'clock", "record
urgent message" and "find the operator".
Each recorded utterance was listened to by a human verifier to determine
if the speaker adequately followed the directions. If an utterance
contained extraneous words or excessive noise, it was not included in
the corpus.
***
(2) GALE Phase 1 Chinese Broadcast News Parallel Text - Part 1
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T23>
is the first of the three-part GALE Phase 1 Chinese Broadcast News
Parallel Text, which, along with other corpora, was used as training
data in Year 1 (Phase 1) of the DARPA-funded GALE program. This corpus
contains transcripts and English translations of Chinese broadcast news
programming. It does not contain the audio files from which the
transcripts and translations were generated.
A total of 23.3 hours of Chinese broadcast news programming was selected
from two sources, China Central TV (CCTV) (a broadcaster from Mainland
China) and Phoenix TV (a Hong Kong-based satellite TV station). The
transcripts and translations represent recordings of five different
programs. A manual selection procedure was used to choose data
appropriate for the GALE program, namely, news programs focusing on
current events. Stories on topics such as sports, entertainment news,
and stock markets were excluded from the data set. * *
***
(3) The purpose of the Nationwide Speech Project
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S15>
(NSP) was to collect a large amount of speech produced by male and
female talkers representing the primary regional varieties of American
English: New England, Mid-Atlantic, North, Midland, South and West.
This release represents part of the work conducted by the authors at
Indiana University. It contains approximately 60 hours of speech, nearly
one hour of speech from each of 60 white American English speakers
--including five male and five female talkers from the six dialect
regions -- reading words and sentences. The corpus can be used for
perceptual and acoustic experiments designed to explore the role of
variation in spoken language processing. Such applications include
speech science experiments and sociolinguistic or sociophonetic research.
The speakers were recruited from the Indiana University community; they
were all 18-25 years old at the time of recording, had lived exclusively
in one region prior to age 18, and both parents of each speaker were
also raised in the same region. Further demographic information about
the speakers is provided. The materials include 102 high predictability
sentences and five repetitions of each of 10 hVd words. The high
predictability sentences are 5-8 words in length and the final word in
each sentence is highly predictable based on the preceding semantic
context. The 10 hVd words are: heed, hid, hayed, head, had, hod, hud,
hoes, hood and who'd.
------------------------------------------------------------------------
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
*
Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA http://www.ldc.upenn.edu*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20070924/e8944576/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list