[Corpora-List] New from the LDC

Fri Aug 24 17:10:51 UTC 2007

LDC2007S10
*-  **2003 NIST Rich Transcription Evaluation Data* 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S10>*  
-  *

LCD2007T38
*-  ** Chinese Gigaword Third Edition 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T38>  
**-

*
The Linguistic Data Consortium (LDC) is pleased to announce the 
availability of two new publications.
*
***
**
------------------------------------------------------------------------

**
*New Publications

*

(1)  2003 NIST Rich Transcription Evaluation Data 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S10> 
[note: this link takes you to ARL Urdu Speech 2007S03] contains the test 
material used in the 2003 Rich Transcription Spring and Fall evaluations 
administered by the NIST (National Institute of Standards and 
Technology) Speech Group <http://www.nist.gov/speech>. The Spring 
evaluation (RT-03S) focused on Speech-To-Text (STT) tasks for broadcast 
news speech and conversational telephone speech in three languages: 
English, Mandarin Chinese and Arabic. That evaluation also included one 
Metadata Extraction (MDE) task, speaker diarization for broadcast news 
speech and conversational telephone speech in English. The Fall 
evaluation (RT-03F) focused on MDE tasks including speaker diarization, 
speaker-attributed STT, SU (sentence/semantic unit) detection and 
disfluency detection for broadcast news speech and conversational 
telephone speech in English. For complete information about the 
evaluations, see the RT-03 Spring Evaluation Website 
<http://www.nist.gov/speech/tests/rt/rt2003/spring> and the RT-03 Fall 
Evaluation Website 
<http://www.nist.gov/speech/tests/rt/rt2003/fall/index.htm>.

The English Broadcast News (BN) dataset is approximately three hours 
long and composed of 30-minute excerpts from six different broadcasts.  
The Mandarin Chinese BN dataset is approximately one hour long and 
composed of 12-minute excerpts from five different broadcasts.  The 
Arabic BN dataset is also approximately one hour long; it is composed of 
30-minute excerpts from two different broadcasts.  For all BN datasets, 
the broadcast were selected from TDT-4 sources and the evaluation 
excerpts were transcribed to the nearest story boundary.

The English Conversational Telephone Speech (CTS) dataset is 
approximately 6 hours long. It is composed of 5-minute excerpts from 72 
different conversations: 36 from the Switchboard Cellular 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2001S13> 
collection and 36 from the Fisher collection 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004S13>. 
The Mandarin Chinese CTS dataset is approximately one hour long and 
composed of 5-minute excerpts from 12 different conversations from the 
CallFriend Mandarin Chinese data 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC96S55>.  
The Arabic CTS set is also approximately one hour long. It is composed 
of 5-minute excerpts from 12 different conversations from the CallHome 
Egyptian Arabic data 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC97S45>.  
For all CTS datasets, the evaluation excerpts were transcribed to the 
nearest turn.

No manual (human-annotated) segmentations are provided. Sites were 
required to generate their own segmentations automatically.  Unlike the 
BN audio files where the full broadcasts were provided, the CTS audio 
files contain only the evaluation excerpts.

***

(2) Chinese Gigaword Third Edition 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T38> 
is a comprehensive archive of newswire text data that has been acquired 
over several years by the LDC. This edition includes all of the contents 
in Chinese Gigaword Second Edition (LDC2005T14) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T14> 
as well as new data collected after the publication of that edition. 
Also, an archive of articles from a new newswire source (Agence France 
Presse) has been added in the third edition.

The four distinct international sources of Chinese newswire included in 
this edition are the following:

    * Agence France Presse (afp_cmn)
    * Central News Agency, Taiwan (cna_cmn)
    * Xinhua News Agency (xin_cmn)
    * Zaobao Newspaper (zbn_cmn)

All text files in this corpus have been converted to UTF-8 character 
encoding.

New in the Third Edition:

    * Over six years worth of articles (October 2000 through December
      2006) from Agence France Presse are being released for the first
      time.
    * Two years worth of new articles (January 2005 through December
      2006) have been added to the Xinhua data set.
    * Nearly two years worth of content was added to the CNA data set. 
    * A small set of older stories (October through December 1998) have
      been added from Zaobao; these were previously published by LDC as
      part of TDT3 Multilanguage Text Version 2.0 (LDC2001T58)
      <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2001T58>
      and are being included in Gigaword for the first time.

------------------------------------------------------------------------

Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------

*
Linguistic Data Consortium                     Phone: (215) 573-1275
University of Pennsylvania                       Fax: (215) 573-2175
3600 Market St., Suite 810                         ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA                  http://www.ldc.upenn.edu*

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20070824/22c2b34f/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora