<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">

</head>

<body bgcolor="#ffffff" text="#000000">

<div class="moz-text-html" lang="x-western">

<div class="moz-text-html" lang="x-western"> <br>

<div align="center">LDC2005T12<br>

<a

 href="http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2005T12"><font

 face="Times New Roman"><b>English Gigaword Second Edition</b></font></a><br>

<br>

LDC2005S15<br>

<a

 href="http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2005S15"><font

 face="Times New Roman"><b>HKUST Mandarin Telephone Speech, Part 1</b></font></a><br>

<br>

LDC2005T32<br>

<font face="Times New Roman"><a

 href="http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2005T32"><b>HKUST

Mandarin Telephone Transcript Data, Part 1</b></a></font><br>

<br>

<br>

<big><font face="Times New Roman"><small>The Linguistic Data Consortium

(LDC) would like to announce the availability of three new corpora.</small></font></big>

</div>

<big><font face="Times New Roman"><small><br>

</small></font></big>

<hr size="2" width="100%"><big><font face="Times New Roman"><small><br>

</small></font></big>

<font face="Times New Roman"><a

 href="http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2005T12">English

Gigaword Second Edition</a> is a comprehensive archive of

newswire text data in English that has been acquired over several years

by the LDC. This release includes all of the contents in the first

release of the English Gigaword corpus (LDC2003T05) as well as new data

from July 2002 through Dec 2004. Some minor updates to these documents

have been made; namely, the text portions of "story" type documents

have been line-wrapped such that each line does not exceed 80

characters. Documents of the other types have not been

modified.  The corpus contains five distinct international sources of

English

newswire:

<br>

<br>

Agence France Press English Service (afe)

<br>

Associated Press Worldstream English Service (apw)

<br>

Central News Agency of Taiwan English Service (cne)

<br>

The New York Times Newswire Service (nyt)

<br>

The Xinhua News Agency English Service (xie)

<br>

<br>

</font>

<div align="center"><font face="Times New Roman">*</font><br>

</div>

<font face="Times New Roman"><br>

The Hong Kong University of Science and Technology (HKUST) collected

and transcribed 200 hours of Mandarin Chinese conversational telephone

speech from Mandarin speakers in mainland China.  <a

 href="http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2005S15">HKUST

Mandarin

Telephone Speech, Part 1</a> contains the training and development sets

with 873 and 24 calls, respectively.

<br>

<br>

All calls were operator-assisted, namely, an operator would call two

participants as scheduled to initiate a call. Subjects were asked about

demographic questions before they were bridged for normal conversation.

Their answers to the demographic questions were recorded on separate

files.  Subjects were allowed to talk up to 10 minutes. With a few

exceptions,

most calls are of the maximum length.

Each side of a call was recorded on a separate wav file, sampled at 8

bits (a-law encoded), 8Khz. <br>

</font><font face="Times New Roman"><small><br>

</small><br>

</font>

<div align="center"><font face="Times New Roman">*</font><br>

</div>

<p><font face="Times New Roman"><a

 href="http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2005T32">HKUST

Mandarin Telephone Transcript Data, Part 1</a> is the

corresponding transcription for HKUST Mandarin Telephone Speech Data,

Part 1. Standard simplified Chinese characters, encoded in GBK

(CP-936), were used. The transcribed speech was segmented at natural

boundaries wherever possible and each segment is no more than 10

seconds long. The Chinese text is not segmented into words, though

there are occasional white spaces within some turns.  HKUST Mandarin

Telephone Transcript Data, Part 1 is distributed via web-download.<br>

</font></p>

<p><big><font face="Times New Roman"><small><br>

<br>

</small></font></big></p>

<div align="center">

<hr size="2" width="100%"><br>

</div>

<big><font face="Times New Roman"><small><br>

</small></font></big>

<div align="center"><big><font face="Times New Roman"><small>If you

need further

information, or would like to inquire about

membership to the LDC, please email <a class="moz-txt-link-abbreviated"

 href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a> or call +1 215

573 2175.<br>

<br>

</small></font></big></div>

<div align="center">--------------------------------------------------------------------<br>

</div>

<div align="center">

<pre class="moz-signature" cols="72">Linguistic Data Consortium                     Phone: (215) 573-1275

3600 Market Street                             Fax:   (215) 573-2175

Suite 810                                          <a

 class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>

Philadelphia, PA 19104                      <a

 class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></pre>

</div>

</div>

</div>

</body>

</html>