[Corpora-List] News from the LDC

Wed Jun 28 20:49:11 UTC 2006

LDC2006S35*
CSLU: Multilanguage Telephone Speech Version 1.2 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S35>
*

LDC2006S31
*NIST 2003 Language Recognition Evaluation 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S31>
*

LDC2006T12
*Spanish Gigaword First Edition 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T12>

*

The Linguistic Data Consortium (LDC) would like to announce the 
availability of three new publications.

------------------------------------------------------------------------

(1) The CSLU:  Multilanguage Telephone Speech Version 1.2 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S35> 
corpus consists of telephone speech from eleven languages: English, 
Farsi, French, German, Hindi, Japanese, Korean, Mandarin, Spanish, 
Tamil, and Vietnamese. The corpus contains fixed vocabulary utterances 
(eg. days of the week) as well as fluent continuous speech. The current 
release includes recorded utterances from about 2052 speakers, for a 
total of about 38.5 hours of speech. Time-aligned phonetic 
transcriptions for 619 of the utterances are also included.  For the 
data collection, the sampling rate was 8khz and the files were stored in 
16bit linear format on a UNIX file system. Each utterance was recorded 
as a separate file. 

*

(2) The goal of the NIST Language Recognition Evaluation (LRE) is to 
establish the baseline of current performance capability for language 
recognition of conversational telephone speech and to lay the groundwork 
for further research efforts in the field. The series had its first 
evaluation in 1996. The 2003 NIST Language Recognition Evaluation 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S31> 
(LRE-03) was part of this ongoing series of evaluations of language 
recognition technology.  The task evaluated was the detection of a given 
target language. Given a test segment of speech, a target language was 
assigned as a test hypothesis, and the task was to determine whether 
this test hypothesis was true or false.

Each speech file is one side of a "4 wire" telephone conversation 
represented as 8-bit, 8kHz mulaw data. There are 7990 speech files in 
sphere(.sph) format for a total of around six hours of speech. The 
speech data was compiled from the LDC's CALLFRIEND, CALLHOME, and 
SWITCHBOARD-2 corpora.

*

(3) The Spanish Gigaword First Edition 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T12> 
is a comprehensive archive of newswire text data that has been acquired 
over several years by the Linguistic Data Consortium; some of the data 
included has been released previously in other LDC corpora.

The three distinct international sources of Spanish newswire in this 
edition, and the time spans of collection covered for each, are as follows:

    * Agence France-Presse, Spanish Service, May 1994 - Dec 2005
    * Associated Press Worldstream, Spanish, Nov 1993 - Dec 2005
    * Xinhua News Agency, Spanish Service, Sep 2001 - Dec 2005

------------------------------------------------------------------------

If you need further information, or would like to inquire about 
membership to the LDC, please email ldc at ldc.upenn.edu or call +1 215 573 
1275.

--------------------------------------------------------------------

Linguistic Data Consortium                     Phone: (215) 573-1275
University of Pennsylvania                       Fax: (215) 573-2175
3600 Market St., Suite 810                         ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA                  http://www.ldc.upenn.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20060628/369ccb5a/attachment.htm>