<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">

</head>

<body bgcolor="#ffffff" text="#000000">

<p class="MsoNormal" style="text-align: center;" align="center">LDC2007S12<br>

<a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S12"><b>2004

Spring NIST Rich Transcription (RT-04S) Evaluation Data</b></a><br>

<br>

LDC2007T19<br>

<a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T19"><b>MITRE

1997 Mandarin Broadcast News Speech Translations(Hub-4NE)</b></a><br>

<br>

</p>

<p class="MsoNormal" style="text-align: center;" align="center"><b>The

:Linguistic Data Consortium (LDC) is pleased to announce the

availability of two new publications.</b><br>

<b><br>

</b></p>

<hr size="2" width="100%">

<p class="MsoNormal" style="margin-bottom: 12pt; text-align: center;"

 align="center"><b>New

Publications<br>

</b></p>

<p>(1)  <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S12">2004

Spring NIST Rich Transcription (RT-04S) Evaluation Data</a> contains

the test

material (meeting speech and reference transcripts) used in the RT-04S

evaluation administered by the <a href="http://www.nist.gov/speech">NIST

(National Institute of Standards and Technology) Speech Group</a>. Rich

Transcription (RT) is broadly defined as a fusion of speech-to-text

technology

and metadata extraction technologies designed to provide the basis for

a

generation of more usable transcriptions of human-human meeting speech.<o:p></o:p></p>

<p>The data in this release consists of portions of meeting speech

collected

and/or transcribed by the International Computer Science Institute

(ICSI) at <st1:city><st1:place>Berkeley</st1:place></st1:city>,

the Interactive Systems Laboratories (ISL) at <st1:place><st1:placename>Carnegie</st1:placename>

<st1:placename>Mellon</st1:placename> <st1:placetype>University</st1:placetype></st1:place>,

NIST and LDC. The complete meeting speech and corresponding transcript

data

sets are available from LDC's catalog as follows: <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004S02">ICSI

Meeting Speech (LDC2004S02)</a>, <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T04">ICSI

Meeting Transcripts (LDC2004T04)</a>, <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004S05">ISL

Meeting Speech Part 1 (LDC2004S05)</a>, <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T10">ISL

Meeting Transcripts Part 1 (LDC2004T10)</a>, <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004S09">NIST

Meeting Pilot Corpus Speech (LDC2004S09)</a> and <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T13">NIST

Meeting Pilot Corpus Transcripts and Metadata (LDC2004T13)</a>.<o:p></o:p></p>

<p>RT-04S included the following tasks in the meeting domain: <o:p></o:p></p>

<dl>

  <dt><strong>Speech-to-Text Transcription (STT) tasks</strong> </dt>

  <dd><strong>Microphone conditions:</strong>

    <ul>

      <li>Multiple distant microphones </li>

      <li>Single distant microphone </li>

      <li>Individual head microphone </li>

    </ul>

  </dd>

  <dd><strong>Processing time conditions:</strong>

    <ul>

      <li>Unlimited time STT </li>

      <li>Less than or equal to twenty times realtime </li>

      <li>Less than or equal to ten times realtime </li>

      <li>Less than or equal to one times realtime </li>

    </ul>

  </dd>

  <dt><strong>Diarization (SPKR) task (?who spoke when?)</strong> </dt>

  <dd><strong>Microphone conditions:</strong>

    <ul>

      <li>Multiple distant microphones </li>

      <li>Single distant microphone </li>

    </ul>

  </dd>

  <dd><strong>Input conditions:</strong>

    <ul>

      <li>Speech input only </li>

      <li>Speech plus reference transcript input </li>

    </ul>

  </dd>

  <dd><strong>Processing time conditions:</strong>

    <ul>

      <li>Unlimited time </li>

      <li>Less than or equal to twenty times realtime </li>

      <li>Less than or equal to ten times realtime </li>

      <li>Less than or equal to one time realtime</li>

    </ul>

  </dd>

</dl>

<p class="MsoNormal" style="margin-left: 1in; text-indent: -0.25in;"><o:p></o:p></p>

<p class="MsoNormal" style="margin-bottom: 12pt;">Further information

about the evaluation is

available on the

<a href="http://www.nist.gov/speech/tests/rt/rt2004/spring/">RT-04

Spring

Evaluation Website</a>.  <o:p></o:p></p>

<p class="MsoNormal" style="text-align: center;" align="center">*<o:p></o:p></p>

<p class="MsoNormal"><b><br>

</b>(2)  <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T19">MITRE

1997 Mandarin Broadcast News Transcripts Translations (Hub-4NE)</a> was

developed by The MITRE Corporation and contains segment-aligned English

translations of the 1997 DARPA HUB4-NE Mandarin transcripts. The

original

transcripts and the corresponding broadcast news audio are available as

separate LDC publications, <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC98T24"><span

 style="text-decoration: none;">1997 Mandarin Broadcast News

Transcripts (HUB4-NE) (LDC98T24)</span></a> and <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC98S73"><span

 style="text-decoration: none;">1997 Mandarin Broadcast News

Speech (HUB4-NE) (LDC98S73)</span></a>. <o:p></o:p></p>

<p>The source data is comprised of 30 hours of recorded Mandarin

broadcasts

collected by the LDC in 1997 from Voice of America, China Central TV

and

KAZN-AM, a commercial radio station based in <st1:place><st1:city>Los

Angeles</st1:city>, <st1:state>CA</st1:state></st1:place>. The

original transcript segmentation is

suitable for speech recognition, but does not support machine

translation and

machine translation evaluation. Therefore, the Mandarin side of these

aligned

transcripts was resegmented for this release; in all other respects,

the

Mandarin transcripts in this publication are identical to the original

transcripts. <o:p></o:p></p>

<p>The dataset in this release consists of 376K words of English text

and 517K

characters of Mandarin text. The English text was produced by

translators with

no access to the original audio. The translators were given specific

guidelines

for translation, and those are included in this distribution. A portion

of the

source data (6%) was translated four times in order to support

experiments in

translation evaluation. <o:p></o:p></p>

<hr size="2" width="100%"><br>

<div align="center"><small><font face="Courier New, Courier, monospace"><br>

Ilya

Ahtaridis<br>

Membership Coordinator</font></small><br>

--------------------------------------------------------------------

<font face="Courier New, Courier, monospace"><br>

</font></div>

<div align="center">

<pre class="moz-signature" cols="72"><b><small><font

 face="Courier New, Courier, monospace">

</font></small>Linguistic Data Consortium                     Phone: (215) 573-1275

University of Pennsylvania                       Fax: (215) 573-2175

3600 Market St., Suite 810                         <a

 class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>

Philadelphia, PA 19104 USA                  <a

 class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></b></pre>

</div>

</body>

</html>