[Corpora-List] New Corpora from the LDC

LDC Office ldc at ldc.upenn.edu
Mon Jan 6 16:35:01 UTC 2003


The Linguistic Data Consortium (LDC) is pleased to announce the
availability of three new corpora.


	     **  1997 HUB5 Spanish Evaluation  **

	     **  2000 Communicator Evaluation  **

    **  Grassfields Bantu Fieldwork: Ngomba Tone Paradigms  **


1.  The 1997 Hub-5 Spanish evaluation is part of an ongoing series
of periodic evaluations conducted by NIST.  This evaluation focused
on the task of transcribing conversational speech into text. Each
conversation is represented as a "4-wire" recording, that is, with
two distinct sides, one from each end of the telephone circuit. Each
side is recorded and stored as a standard telephone codec signal
(8 kHz sampling, 8-bit mu-law encoding).  The 1997 HUB5 Spanish
Evaluation contain 426 Mbytes or hours of sphere data.

For further information, including a link to additional documentation on
the NIST web site, please visit:

http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2002S25

Institutions that have membership in the LDC during the 2002
Membership Year will be able to receive this corpus free of charge.
Nonmembers may purchase this publication for $1000.


2.  The original goals of the Communicator program were to support the
creation of speech-enabled interfaces that scale gracefully across
modalities, from speech-only to interfaces that include graphics,
maps, pointing and gesture. The original vision of the Communicator
systems included the ability of a user, during one ten-minute session,
to plan a three-leg trip, with the three flights/legs on three different
days, with rental car and hotel in each of the two "away" cities, plus
dictating/sending a voice-mail message.

The actual research that led to the data collections in 2000 and 2001
explored ways to construct better spoken-dialogue systems, with which
users interact via speech-alone to perform relatively complex tasks such
as travel planning. During 2000 and 2001 two large data sets were
collected, in which users used the Communicator systems built by the
research groups to do travel planning. The 2000 Communicator Evaluation
publication consists of all the data from the 2000 collection.

For the 2000 evaluation, each user called the nine different automated
travel-planning systems to make simulated flight reservations. All audio
files are in SPHERE format, recorded in 8 bit ulaw and pcm, at 8 KHZ.
The two-channel sphere files total ~62 hours of audio (3415 MB),
representing ~317K words in transcription.

Institutions that have membership in the LDC during the 2002
Membership Year will be able to receive this corpus free of charge.
Nonmembers may purchase this publication for $900.


3.  Grassfields Bantu Fieldwork: Ngomba Tone Paradigms contains tone
paradigms of the language Ngomba, a Bamileke (Grassfields Bantu)
language spoken by some 63,000 people in the Western Province of
Cameroon. Ngomba's tone system is undescribed, but it has many
similarities with the closely related Yémba language (also known as
Bamileke Dschang).

This publication contains 755 audio files. The files in rawdata are 21
extended audio and laryngograph recordings with ESPS xlabel files; each
one of the raw sound files contains the complete recording of one of the
tenses. Transcriptions are provided for the audio clips using the
IPA-based orthography, and using phonetic and tonological transcription
systems.  The verbal tone paradigms are also accessible over the
internet, along with an interface for browsing and editing
transcriptions, at http://www.ldc.upenn.edu/Projects/grassfields

For further information, please visit:

http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2001S16

This publication is free of charge to 2001 and 2002 members.  The cost
of the first 100 copies of this publication (not counting the copies
distributed to LDC members) is covered by NSF Grant Number 9983258.
These copies are, therefore, free of charge to qualified researchers;
a $30 shipping and handling fee applies. After these first 100 copies
are distributed, additional copies will be available for the production
cost of $150 per CD-ROM.


 			  **


If you need additional information before placing your order, or
would like to inquire about membership in the LDC, please send email to
<ldc at ldc.upenn.edu> or call (215) 573-1275.


---------------------------------------------------------------------
Linguistic Data Consortium          Phone: (215) 573-1275
3600 Market Street                  Fax:   (215) 573-2175
Suite 810                           email: ldc at unagi.cis.upenn.edu
Philadelphia, PA 19104-2653         www: http://www.ldc.upenn.edu



More information about the Corpora mailing list