[Corpora-List] New from the LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Fri Oct 26 16:23:16 UTC 2007


LDC2007S12
*2004 Spring NIST Rich Transcription (RT-04S) Evaluation Data* 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S12>

LDC2007T19
*MITRE 1997 Mandarin Broadcast News Speech Translations(Hub-4NE)* 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T19>

*The :Linguistic Data Consortium (LDC) is pleased to announce the 
availability of two new publications.*
*
*

------------------------------------------------------------------------

*New Publications
*

(1)  2004 Spring NIST Rich Transcription (RT-04S) Evaluation Data 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S12> 
contains the test material (meeting speech and reference transcripts) 
used in the RT-04S evaluation administered by the NIST (National 
Institute of Standards and Technology) Speech Group 
<http://www.nist.gov/speech>. Rich Transcription (RT) is broadly defined 
as a fusion of speech-to-text technology and metadata extraction 
technologies designed to provide the basis for a generation of more 
usable transcriptions of human-human meeting speech.

The data in this release consists of portions of meeting speech 
collected and/or transcribed by the International Computer Science 
Institute (ICSI) at Berkeley, the Interactive Systems Laboratories (ISL) 
at Carnegie Mellon University, NIST and LDC. The complete meeting speech 
and corresponding transcript data sets are available from LDC's catalog 
as follows: ICSI Meeting Speech (LDC2004S02) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004S02>, 
ICSI Meeting Transcripts (LDC2004T04) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T04>, 
ISL Meeting Speech Part 1 (LDC2004S05) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004S05>, 
ISL Meeting Transcripts Part 1 (LDC2004T10) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T10>, 
NIST Meeting Pilot Corpus Speech (LDC2004S09) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004S09> 
and NIST Meeting Pilot Corpus Transcripts and Metadata (LDC2004T13) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T13>.

RT-04S included the following tasks in the meeting domain:

*Speech-to-Text Transcription (STT) tasks*
    *Microphone conditions:*

        * Multiple distant microphones
        * Single distant microphone
        * Individual head microphone

    *Processing time conditions:*

        * Unlimited time STT
        * Less than or equal to twenty times realtime
        * Less than or equal to ten times realtime
        * Less than or equal to one times realtime

*Diarization (SPKR) task (?who spoke when?)*
    *Microphone conditions:*

        * Multiple distant microphones
        * Single distant microphone

    *Input conditions:*

        * Speech input only
        * Speech plus reference transcript input

    *Processing time conditions:*

        * Unlimited time
        * Less than or equal to twenty times realtime
        * Less than or equal to ten times realtime
        * Less than or equal to one time realtime

Further information about the evaluation is available on the RT-04 
Spring Evaluation Website 
<http://www.nist.gov/speech/tests/rt/rt2004/spring/>. 

*

*
*(2)  MITRE 1997 Mandarin Broadcast News Transcripts Translations 
(Hub-4NE) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T19> 
was developed by The MITRE Corporation and contains segment-aligned 
English translations of the 1997 DARPA HUB4-NE Mandarin transcripts. The 
original transcripts and the corresponding broadcast news audio are 
available as separate LDC publications, 1997 Mandarin Broadcast News 
Transcripts (HUB4-NE) (LDC98T24) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC98T24> 
and 1997 Mandarin Broadcast News Speech (HUB4-NE) (LDC98S73) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC98S73>.

The source data is comprised of 30 hours of recorded Mandarin broadcasts 
collected by the LDC in 1997 from Voice of America, China Central TV and 
KAZN-AM, a commercial radio station based in Los Angeles, CA. The 
original transcript segmentation is suitable for speech recognition, but 
does not support machine translation and machine translation evaluation. 
Therefore, the Mandarin side of these aligned transcripts was 
resegmented for this release; in all other respects, the Mandarin 
transcripts in this publication are identical to the original transcripts.

The dataset in this release consists of 376K words of English text and 
517K characters of Mandarin text. The English text was produced by 
translators with no access to the original audio. The translators were 
given specific guidelines for translation, and those are included in 
this distribution. A portion of the source data (6%) was translated four 
times in order to support experiments in translation evaluation.

------------------------------------------------------------------------


Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------

*
Linguistic Data Consortium                     Phone: (215) 573-1275
University of Pennsylvania                       Fax: (215) 573-2175
3600 Market St., Suite 810                         ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA                  http://www.ldc.upenn.edu*

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20071026/26123e56/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list