<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">

  <title></title>

</head>

<body bgcolor="#ffffff" text="#000000">

<div class="moz-text-html" lang="x-western">

<div align="center"> LDC2005T14<br>

<a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T14">Chinese

Gigaword Release Second Edition

</a><br>

<br>

LDC2005S16

<br>

<a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S16">MDE

RT-04 Training Data Speech</a>

<br>

<br>

LDC2005T24<br>

<a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T24">MDE

RT-04 Training Data Text/Annotations</a>

<br>

<br>

<br>

<font face="Times New Roman">T</font><big><font face="Times New Roman"><small>he

Linguistic Data Consortium

(LDC) would like to announce the availability of three new corpora.</small></font></big></div>

<br>

<hr size="2" width="100%"><br>

(1) <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T14">Chinese

Gigaword Release Second Edition</a> is a comprehensive archive of

newswire text data in Chinese that has been acquired over several years

by the LDC.

<br>

This release includes all of the contents in the first release of the

Chinese Gigaword corpus (LDC2003T09), material from one new source, as

well as new materials from the other two sources.  Thus, the corpus

contains three distinct international sources of Chinese newswire -

Central News Agency, Taiwan, Xinhua News Agency, and Zaobao.

<br>

<br>

Some minor updates to the documents from the first release have been

made; namely, the text portions of "story" type documents have been

line-wrapped such that each line does not exceed 40 characters.

Documents of the other types have not been modified.  <br>

<br>

(2) <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S16">MDE

RT-04 Training Data Speech</a> was created  to provide training data

for the RT-04 Fall Metadata Extraction (MDE) Evaluation, part of the

DARPA EARS (Efficient, Affordable, Reusable Speech-to-Text) Program.

The goal of MDE is to enable technology that can take raw

Speech-to-Text output and refine it into forms that are of more use to

humans and to downstream automatic processes. In simple terms, this

means the creation of automatic transcripts that are maximally

readable. This readability might be achieved in a number of ways:

flagging non-content words like filled pauses and discourse markers for

optional removal; marking sections of disfluent speech; and creating

boundaries between natural breakpoints in the flow of speech so that

each sentence or other meaningful unit of speech might be presented on

a separate line within the resulting transcript. Natural

capitalization, punctuation and standardized spelling, plus sensible

conventions for representing speaker turns and identity are further

elements in the readable transcript. LDC has defined a SimpleMDE

annotation task specification and has annotated English telephone and

broadcast news data to provide training data for MDE.  <br>

<br>

(3)<a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T24">

MDE RT-04 Training Data Text/Annotations</a> was created  to provide

training data for the RT-04 Fall Metadata Extraction (MDE) Evaluation,

part of the DARPA EARS (Efficient, Affordable, Reusable Speech-to-Text)

Program.  In this release, some original annotations have been

re-mapped to new MDE elements to support better annotation consistency.

In particular, the mapping affects Discourse Responses (DR), Discourse

Markers (DM) and Backchannel SUs (BC).  <br>

<br>

<br>

<hr size="2" width="100%"><br>

<br>

<div align="center"><big><font face="Times New Roman"><small>If you

need further

information, or would like to inquire about

membership to the LDC, please email <a class="moz-txt-link-abbreviated"

 href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a> or call +1 215

573 2175.<br>

<br>

</small></font></big></div>

<div align="center">--------------------------------------------------------------------<br>

</div>

<div align="center">

<pre class="moz-signature" cols="72">Linguistic Data Consortium                     Phone: (215) 573-1275

3600 Market Street                             Fax:   (215) 573-2175

Suite 810                                          <a

 class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>

Philadelphia, PA 19104                      <a

 class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></pre>

</div>

</div>

</body>

</html>