[Corpora-List] New from the LDC
Linguistic Data Consortium
ldc at ldc.upenn.edu
Fri Aug 24 17:10:51 UTC 2007
LDC2007S10
*- **2003 NIST Rich Transcription Evaluation Data*
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S10>*
- *
LCD2007T38
*- ** Chinese Gigaword Third Edition
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T38>
**-
*
The Linguistic Data Consortium (LDC) is pleased to announce the
availability of two new publications.
*
***
**
------------------------------------------------------------------------
**
*New Publications
*
(1) 2003 NIST Rich Transcription Evaluation Data
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S10>
[note: this link takes you to ARL Urdu Speech 2007S03] contains the test
material used in the 2003 Rich Transcription Spring and Fall evaluations
administered by the NIST (National Institute of Standards and
Technology) Speech Group <http://www.nist.gov/speech>. The Spring
evaluation (RT-03S) focused on Speech-To-Text (STT) tasks for broadcast
news speech and conversational telephone speech in three languages:
English, Mandarin Chinese and Arabic. That evaluation also included one
Metadata Extraction (MDE) task, speaker diarization for broadcast news
speech and conversational telephone speech in English. The Fall
evaluation (RT-03F) focused on MDE tasks including speaker diarization,
speaker-attributed STT, SU (sentence/semantic unit) detection and
disfluency detection for broadcast news speech and conversational
telephone speech in English. For complete information about the
evaluations, see the RT-03 Spring Evaluation Website
<http://www.nist.gov/speech/tests/rt/rt2003/spring> and the RT-03 Fall
Evaluation Website
<http://www.nist.gov/speech/tests/rt/rt2003/fall/index.htm>.
The English Broadcast News (BN) dataset is approximately three hours
long and composed of 30-minute excerpts from six different broadcasts.
The Mandarin Chinese BN dataset is approximately one hour long and
composed of 12-minute excerpts from five different broadcasts. The
Arabic BN dataset is also approximately one hour long; it is composed of
30-minute excerpts from two different broadcasts. For all BN datasets,
the broadcast were selected from TDT-4 sources and the evaluation
excerpts were transcribed to the nearest story boundary.
The English Conversational Telephone Speech (CTS) dataset is
approximately 6 hours long. It is composed of 5-minute excerpts from 72
different conversations: 36 from the Switchboard Cellular
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2001S13>
collection and 36 from the Fisher collection
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004S13>.
The Mandarin Chinese CTS dataset is approximately one hour long and
composed of 5-minute excerpts from 12 different conversations from the
CallFriend Mandarin Chinese data
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC96S55>.
The Arabic CTS set is also approximately one hour long. It is composed
of 5-minute excerpts from 12 different conversations from the CallHome
Egyptian Arabic data
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC97S45>.
For all CTS datasets, the evaluation excerpts were transcribed to the
nearest turn.
No manual (human-annotated) segmentations are provided. Sites were
required to generate their own segmentations automatically. Unlike the
BN audio files where the full broadcasts were provided, the CTS audio
files contain only the evaluation excerpts.
***
(2) Chinese Gigaword Third Edition
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T38>
is a comprehensive archive of newswire text data that has been acquired
over several years by the LDC. This edition includes all of the contents
in Chinese Gigaword Second Edition (LDC2005T14)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T14>
as well as new data collected after the publication of that edition.
Also, an archive of articles from a new newswire source (Agence France
Presse) has been added in the third edition.
The four distinct international sources of Chinese newswire included in
this edition are the following:
* Agence France Presse (afp_cmn)
* Central News Agency, Taiwan (cna_cmn)
* Xinhua News Agency (xin_cmn)
* Zaobao Newspaper (zbn_cmn)
All text files in this corpus have been converted to UTF-8 character
encoding.
New in the Third Edition:
* Over six years worth of articles (October 2000 through December
2006) from Agence France Presse are being released for the first
time.
* Two years worth of new articles (January 2005 through December
2006) have been added to the Xinhua data set.
* Nearly two years worth of content was added to the CNA data set.
* A small set of older stories (October through December 1998) have
been added from Zaobao; these were previously published by LDC as
part of TDT3 Multilanguage Text Version 2.0 (LDC2001T58)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2001T58>
and are being included in Gigaword for the first time.
------------------------------------------------------------------------
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
*
Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA http://www.ldc.upenn.edu*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20070824/22c2b34f/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list