<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

</head>

<body bgcolor="#ffffff" text="#000000">

<p class="MsoNormal" style="margin-bottom: 12pt; text-align: center;"

 align="center"><b>-  <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T36">Chinese

Treebank 6.0 (CTB 6.0)</a>  -</b><br>

<br>

<br>

<b>-  </b><a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S11"><b>2004

Spring NIST Rich Transcription (RT-04S) Development Data</b></a><b>  -<br>

<br>

</b></p>

<p class="MsoNormal" style="margin-bottom: 12pt; text-align: center;"

 align="center"><b>The Linguistic Data Consortium (LDC) would like to

announce the availability of two new publications.<br>

</b></p>

<hr size="2" width="100%">

<p class="MsoNormal" style="margin-bottom: 12pt; text-align: center;"

 align="center"><b><br>

</b></p>

<p class="MsoNormal" style="margin-bottom: 12pt; text-align: center;"

 align="center"><b>New Publications</b><o:p></o:p><br>

<br style="">

<!--[endif]--><o:p></o:p></p>

<p>(1) The Chinese Treebank project began at the <st1:place><st1:placetype>University</st1:placetype>

of <st1:placename>Pennsylvania</st1:placename></st1:place> in 1998 and

continues at Penn and the <st1:place><st1:placetype>University</st1:placetype>

of <st1:placename>Colorado</st1:placename></st1:place>. <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T36">Chinese

Treebank 6.0</a> is the latest version produced from this effort,

consisting of

780,000 words (over 1.28 million Chinese characters) that are

segmented,

part-of-speech tagged and fully bracketed. The data sources include

newswire

from Xinhua News Agency, articles from Sinorama Magazine, news from the

website

of the Hong Kong Special Administrative Region and transcripts from

various

broadcast news programs. <o:p></o:p></p>

<p class="MsoNormal" style="margin-bottom: 12pt;">This

release encompasses 2,036 text files, containing 28,295 sentences,

781,351

words and 1,285,149 hanzi (Chinese characters). The data is provided in

two

encodings: GBK and UTF-8, and the annotation has Penn Treebank-style

labeled

brackets.  The data is provided in four different formats: raw text,

word

segmented, word segmented and POS-tagged, and syntactically bracketed. 

<br style="">

<!--[if !supportLineBreakNewLine]--><br style="">

<!--[endif]--><o:p></o:p></p>

<p class="MsoNormal" style="text-align: center;" align="center"><b>*</b><o:p></o:p></p>

<p class="MsoNormal" style=""> <o:p></o:p></p>

<p>(2)  The <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S11">2004

Spring NIST Rich Transcription (RT-04S) Development Data</a> contains

the test

material (meeting speech and reference transcripts) used in the RT-04S

evaluation administered by the <a href="http://www.nist.gov/speech">NIST

(National Institute of Standards and Technology) Speech Group</a>. Rich

Transcription (RT) is broadly defined as a fusion of speech-to-text

technology

and metadata extraction technologies designed to provide the basis for

a

generation of more usable transcriptions of human-human meeting speech.<o:p></o:p></p>

<p>The RT-04S development data consists of approximately 10 minutes of

recordings of eight meetings held at ISCI, CMU, LDC and NIST. Although

the

development data is comprised of 10-minute excerpts from the same data

collection sites which are represented in <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S12">LDC2007S12

2004 Spring NIST Rich Transcription (RT-04S) Evaluation Data</a>, it is

not

completely reflective of the evaluation test data since it contains

lapel mics

in lieu of head mics for the LDC and CMU data and some different

distant mics

for LDC data. <br>

<br>

RT-04S included the following tasks in the meeting domain:<br>

</p>

<p>Speech-to-Text Transcription (STT) tasks<br>

</p>

<blockquote>Microphone conditions:<br>

·         Multiple distant microphones<br>

·         Single distant microphone<br>

·         Individual head microphone<br>

  <br>

Processing time conditions:<br>

·         Unlimited time STT<br>

·         Less than or equal to twenty times realtime<br>

·         Less than or equal to ten times realtime<br>

·         Less than or equal to one times realtime<br>

</blockquote>

<p><br>

Diarization (SPKR) task (who spoke when)<br>

</p>

<blockquote>Microphone conditions:<br>

·         Multiple distant microphones<br>

·         Single distant microphone<br>

  <br>

Input conditions:<br>

·         Speech input only<br>

·         Speech plus reference transcript input<br>

  <br>

Processing time conditions:<br>

·         Unlimited time<br>

·         Less than or equal to twenty times realtime<br>

·         Less than or equal to ten times realtime<br>

·         Less than or equal to one time realtime <br>

  <o:p></o:p></blockquote>

<o:p></o:p>

<pre class="moz-signature" cols="72"></pre>

<hr size="2" width="100%">

<div align="center"><small><font face="Courier New, Courier, monospace"><br>

<br>

Ilya

Ahtaridis</font></small><br>

<small><font face="Courier New, Courier, monospace">Membership

Coordinator</font></small><br>

--------------------------------------------------------------------

<br>

<font face="Courier New, Courier, monospace"></font></div>

<div align="center">

<pre class="moz-signature" cols="72"><b><small><font

 face="Courier New, Courier, monospace">

</font></small>Linguistic Data Consortium                     Phone: (215) 573-1275

University of Pennsylvania                       Fax: (215) 573-2175

3600 Market St., Suite 810                         <a

 class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>

Philadelphia, PA 19104 USA                  <a

 class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></b></pre>

</div>

<pre class="moz-signature" cols="72">

</pre>

</body>

</html>