[Corpora-List] New from the LDC
Linguistic Data Consortium
ldc at ldc.upenn.edu
Fri Oct 26 16:23:16 UTC 2007
LDC2007S12
*2004 Spring NIST Rich Transcription (RT-04S) Evaluation Data*
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S12>
LDC2007T19
*MITRE 1997 Mandarin Broadcast News Speech Translations(Hub-4NE)*
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T19>
*The :Linguistic Data Consortium (LDC) is pleased to announce the
availability of two new publications.*
*
*
------------------------------------------------------------------------
*New Publications
*
(1) 2004 Spring NIST Rich Transcription (RT-04S) Evaluation Data
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S12>
contains the test material (meeting speech and reference transcripts)
used in the RT-04S evaluation administered by the NIST (National
Institute of Standards and Technology) Speech Group
<http://www.nist.gov/speech>. Rich Transcription (RT) is broadly defined
as a fusion of speech-to-text technology and metadata extraction
technologies designed to provide the basis for a generation of more
usable transcriptions of human-human meeting speech.
The data in this release consists of portions of meeting speech
collected and/or transcribed by the International Computer Science
Institute (ICSI) at Berkeley, the Interactive Systems Laboratories (ISL)
at Carnegie Mellon University, NIST and LDC. The complete meeting speech
and corresponding transcript data sets are available from LDC's catalog
as follows: ICSI Meeting Speech (LDC2004S02)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004S02>,
ICSI Meeting Transcripts (LDC2004T04)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T04>,
ISL Meeting Speech Part 1 (LDC2004S05)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004S05>,
ISL Meeting Transcripts Part 1 (LDC2004T10)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T10>,
NIST Meeting Pilot Corpus Speech (LDC2004S09)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004S09>
and NIST Meeting Pilot Corpus Transcripts and Metadata (LDC2004T13)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T13>.
RT-04S included the following tasks in the meeting domain:
*Speech-to-Text Transcription (STT) tasks*
*Microphone conditions:*
* Multiple distant microphones
* Single distant microphone
* Individual head microphone
*Processing time conditions:*
* Unlimited time STT
* Less than or equal to twenty times realtime
* Less than or equal to ten times realtime
* Less than or equal to one times realtime
*Diarization (SPKR) task (?who spoke when?)*
*Microphone conditions:*
* Multiple distant microphones
* Single distant microphone
*Input conditions:*
* Speech input only
* Speech plus reference transcript input
*Processing time conditions:*
* Unlimited time
* Less than or equal to twenty times realtime
* Less than or equal to ten times realtime
* Less than or equal to one time realtime
Further information about the evaluation is available on the RT-04
Spring Evaluation Website
<http://www.nist.gov/speech/tests/rt/rt2004/spring/>.
*
*
*(2) MITRE 1997 Mandarin Broadcast News Transcripts Translations
(Hub-4NE)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T19>
was developed by The MITRE Corporation and contains segment-aligned
English translations of the 1997 DARPA HUB4-NE Mandarin transcripts. The
original transcripts and the corresponding broadcast news audio are
available as separate LDC publications, 1997 Mandarin Broadcast News
Transcripts (HUB4-NE) (LDC98T24)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC98T24>
and 1997 Mandarin Broadcast News Speech (HUB4-NE) (LDC98S73)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC98S73>.
The source data is comprised of 30 hours of recorded Mandarin broadcasts
collected by the LDC in 1997 from Voice of America, China Central TV and
KAZN-AM, a commercial radio station based in Los Angeles, CA. The
original transcript segmentation is suitable for speech recognition, but
does not support machine translation and machine translation evaluation.
Therefore, the Mandarin side of these aligned transcripts was
resegmented for this release; in all other respects, the Mandarin
transcripts in this publication are identical to the original transcripts.
The dataset in this release consists of 376K words of English text and
517K characters of Mandarin text. The English text was produced by
translators with no access to the original audio. The translators were
given specific guidelines for translation, and those are included in
this distribution. A portion of the source data (6%) was translated four
times in order to support experiments in translation evaluation.
------------------------------------------------------------------------
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
*
Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA http://www.ldc.upenn.edu*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20071026/26123e56/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list