[Corpora-List] New from the LDC

Mon Sep 24 21:29:38 UTC 2007

The Linguistic Data Consortium (LDC) is pleased to announce the 
availability of three new publications.

LDC2007S13
-  *CSLU: Apple Words and Phrases 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S13>*  -

LDC2007T23
-  *GALE Phase 1 Chinese Broadcast News Parallel Text - Part 1 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T23>*  -

LDC2007S15
-  *Nationwide Speech Project 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S15>*  -

------------------------------------------------------------------------

*New Publications*

(1)  CSLU:  Apple Words and Phrases 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S13> 
contains approximately 69.5 hours of speech from 3008 telephone calls 
placed on analog and digital phone systems.  Apple Computer, Inc. 
supported the development of this data and also supplied the list of 
words and phrases collected.  Callers responded to questions and 
repeated a list of phrases as they were prompted.  Each subject called 
the CSLU data collection system by dialing a toll-free number.  The 
analog data were collected via a Worldport Pod on an Apple Quadra A/V.  
The digital data were collected with the CSLU T1 digital data collection 
system.

Callers were prompted to answer certain questions including, What is 
your native language? In which city and state did you spend most of your 
childhood? What time is it now? What day is today?  Callers were also 
instructed to repeat various command and control type phrases, including 
"play previous message again", "make a meeting for today", "quit", "who 
is at work", "what is the area code for this state", "hello, what are my 
messages", "help", "please send a car from the city", "delete my email 
tomorrow", "read this text", "erase all information", "record extended 
phonebook", "transfer all calls to home at twelve o'clock", "record 
urgent message" and "find the operator".

Each recorded utterance was listened to by a human verifier to determine 
if the speaker adequately followed the directions.  If an utterance 
contained extraneous words or excessive noise, it was not included in 
the corpus. 

***

(2)  GALE Phase 1 Chinese Broadcast News Parallel Text - Part 1 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T23> 
is the first of the three-part GALE Phase 1 Chinese Broadcast News 
Parallel Text, which, along with other corpora, was used as training 
data in Year 1 (Phase 1) of the DARPA-funded GALE program.  This corpus 
contains transcripts and English translations of Chinese broadcast news 
programming.  It does not contain the audio files from which the 
transcripts and translations were generated.

A total of 23.3 hours of Chinese broadcast news programming was selected 
from two sources, China Central TV (CCTV) (a broadcaster from Mainland 
China) and Phoenix TV (a Hong Kong-based satellite TV station).  The 
transcripts and translations represent recordings of five different 
programs.  A manual selection procedure was used to choose data 
appropriate for the GALE program, namely, news programs focusing on 
current events.  Stories on topics such as sports, entertainment news, 
and stock markets were excluded from the data set. * *

***

(3) The purpose of the Nationwide Speech Project 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S15> 
(NSP) was to collect a large amount of speech produced by male and 
female talkers representing the primary regional varieties of American 
English: New England, Mid-Atlantic, North, Midland, South and West.  
This release represents part of the work conducted by the authors at 
Indiana University. It contains approximately 60 hours of speech, nearly 
one hour of speech from each of 60 white American English speakers 
--including five male and five female talkers from the six dialect 
regions -- reading words and sentences.  The corpus can be used for 
perceptual and acoustic experiments designed to explore the role of 
variation in spoken language processing.  Such applications include 
speech science experiments and sociolinguistic or sociophonetic research.

The speakers were recruited from the Indiana University community; they 
were all 18-25 years old at the time of recording, had lived exclusively 
in one region prior to age 18, and both parents of each speaker were 
also raised in the same region.  Further demographic information about 
the speakers is provided.  The materials include 102 high predictability 
sentences and five repetitions of each of 10 hVd words.  The high 
predictability sentences are 5-8 words in length and the final word in 
each sentence is highly predictable based on the preceding semantic 
context.   The 10 hVd words are: heed, hid, hayed, head, had, hod, hud, 
hoes, hood and who'd.

------------------------------------------------------------------------

Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------

*
Linguistic Data Consortium                     Phone: (215) 573-1275
University of Pennsylvania                       Fax: (215) 573-2175
3600 Market St., Suite 810                         ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA                  http://www.ldc.upenn.edu*

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20070924/e8944576/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora