<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
</head>
<body bgcolor="#ffffff" text="#000000">
<p class="MsoNormal" style="margin-bottom: 12pt; text-align: center;"
align="center"><b>- <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T36">Chinese
Treebank 6.0 (CTB 6.0)</a> -</b><br>
<br>
<br>
<b>- </b><a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S11"><b>2004
Spring NIST Rich Transcription (RT-04S) Development Data</b></a><b> -<br>
<br>
</b></p>
<p class="MsoNormal" style="margin-bottom: 12pt; text-align: center;"
align="center"><b>The Linguistic Data Consortium (LDC) would like to
announce the availability of two new publications.<br>
</b></p>
<hr size="2" width="100%">
<p class="MsoNormal" style="margin-bottom: 12pt; text-align: center;"
align="center"><b><br>
</b></p>
<p class="MsoNormal" style="margin-bottom: 12pt; text-align: center;"
align="center"><b>New Publications</b><o:p></o:p><br>
<br style="">
<!--[endif]--><o:p></o:p></p>
<p>(1) The Chinese Treebank project began at the <st1:place><st1:placetype>University</st1:placetype>
of <st1:placename>Pennsylvania</st1:placename></st1:place> in 1998 and
continues at Penn and the <st1:place><st1:placetype>University</st1:placetype>
of <st1:placename>Colorado</st1:placename></st1:place>. <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T36">Chinese
Treebank 6.0</a> is the latest version produced from this effort,
consisting of
780,000 words (over 1.28 million Chinese characters) that are
segmented,
part-of-speech tagged and fully bracketed. The data sources include
newswire
from Xinhua News Agency, articles from Sinorama Magazine, news from the
website
of the Hong Kong Special Administrative Region and transcripts from
various
broadcast news programs. <o:p></o:p></p>
<p class="MsoNormal" style="margin-bottom: 12pt;">This
release encompasses 2,036 text files, containing 28,295 sentences,
781,351
words and 1,285,149 hanzi (Chinese characters). The data is provided in
two
encodings: GBK and UTF-8, and the annotation has Penn Treebank-style
labeled
brackets. The data is provided in four different formats: raw text,
word
segmented, word segmented and POS-tagged, and syntactically bracketed.
<br style="">
<!--[if !supportLineBreakNewLine]--><br style="">
<!--[endif]--><o:p></o:p></p>
<p class="MsoNormal" style="text-align: center;" align="center"><b>*</b><o:p></o:p></p>
<p class="MsoNormal" style=""> <o:p></o:p></p>
<p>(2) The <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S11">2004
Spring NIST Rich Transcription (RT-04S) Development Data</a> contains
the test
material (meeting speech and reference transcripts) used in the RT-04S
evaluation administered by the <a href="http://www.nist.gov/speech">NIST
(National Institute of Standards and Technology) Speech Group</a>. Rich
Transcription (RT) is broadly defined as a fusion of speech-to-text
technology
and metadata extraction technologies designed to provide the basis for
a
generation of more usable transcriptions of human-human meeting speech.<o:p></o:p></p>
<p>The RT-04S development data consists of approximately 10 minutes of
recordings of eight meetings held at ISCI, CMU, LDC and NIST. Although
the
development data is comprised of 10-minute excerpts from the same data
collection sites which are represented in <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S12">LDC2007S12
2004 Spring NIST Rich Transcription (RT-04S) Evaluation Data</a>, it is
not
completely reflective of the evaluation test data since it contains
lapel mics
in lieu of head mics for the LDC and CMU data and some different
distant mics
for LDC data. <br>
<br>
RT-04S included the following tasks in the meeting domain:<br>
</p>
<p>Speech-to-Text Transcription (STT) tasks<br>
</p>
<blockquote>Microphone conditions:<br>
· Multiple distant microphones<br>
· Single distant microphone<br>
· Individual head microphone<br>
<br>
Processing time conditions:<br>
· Unlimited time STT<br>
· Less than or equal to twenty times realtime<br>
· Less than or equal to ten times realtime<br>
· Less than or equal to one times realtime<br>
</blockquote>
<p><br>
Diarization (SPKR) task (who spoke when)<br>
</p>
<blockquote>Microphone conditions:<br>
· Multiple distant microphones<br>
· Single distant microphone<br>
<br>
Input conditions:<br>
· Speech input only<br>
· Speech plus reference transcript input<br>
<br>
Processing time conditions:<br>
· Unlimited time<br>
· Less than or equal to twenty times realtime<br>
· Less than or equal to ten times realtime<br>
· Less than or equal to one time realtime <br>
<o:p></o:p></blockquote>
<o:p></o:p>
<pre class="moz-signature" cols="72"></pre>
<hr size="2" width="100%">
<div align="center"><small><font face="Courier New, Courier, monospace"><br>
<br>
Ilya
Ahtaridis</font></small><br>
<small><font face="Courier New, Courier, monospace">Membership
Coordinator</font></small><br>
--------------------------------------------------------------------
<br>
<font face="Courier New, Courier, monospace"></font></div>
<div align="center">
<pre class="moz-signature" cols="72"><b><small><font
face="Courier New, Courier, monospace">
</font></small>Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 <a
class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>
Philadelphia, PA 19104 USA <a
class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></b></pre>
</div>
<pre class="moz-signature" cols="72">
</pre>
</body>
</html>