[Corpora-List] New LDC Corpora

Wed Aug 3 20:57:18 UTC 2005

LDC2005T12
*English Gigaword Second Edition* 
<http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2005T12>

LDC2005S15
*HKUST Mandarin Telephone Speech, Part 1* 
<http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2005S15>

LDC2005T32
*HKUST Mandarin Telephone Transcript Data, Part 1* 
<http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2005T32>

The Linguistic Data Consortium (LDC) would like to announce the 
availability of three new corpora.

------------------------------------------------------------------------

English Gigaword Second Edition 
<http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2005T12> 
is a comprehensive archive of newswire text data in English that has 
been acquired over several years by the LDC. This release includes all 
of the contents in the first release of the English Gigaword corpus 
(LDC2003T05) as well as new data from July 2002 through Dec 2004. Some 
minor updates to these documents have been made; namely, the text 
portions of "story" type documents have been line-wrapped such that each 
line does not exceed 80 characters. Documents of the other types have 
not been modified.  The corpus contains five distinct international 
sources of English newswire:

Agence France Press English Service (afe)
Associated Press Worldstream English Service (apw)
Central News Agency of Taiwan English Service (cne)
The New York Times Newswire Service (nyt)
The Xinhua News Agency English Service (xie)

*

The Hong Kong University of Science and Technology (HKUST) collected and 
transcribed 200 hours of Mandarin Chinese conversational telephone 
speech from Mandarin speakers in mainland China.  HKUST Mandarin 
Telephone Speech, Part 1 
<http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2005S15> 
contains the training and development sets with 873 and 24 calls, 
respectively.

All calls were operator-assisted, namely, an operator would call two 
participants as scheduled to initiate a call. Subjects were asked about 
demographic questions before they were bridged for normal conversation. 
Their answers to the demographic questions were recorded on separate 
files.  Subjects were allowed to talk up to 10 minutes. With a few 
exceptions, most calls are of the maximum length. Each side of a call 
was recorded on a separate wav file, sampled at 8 bits (a-law encoded), 
8Khz.

*

HKUST Mandarin Telephone Transcript Data, Part 1 
<http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2005T32> 
is the corresponding transcription for HKUST Mandarin Telephone Speech 
Data, Part 1. Standard simplified Chinese characters, encoded in GBK 
(CP-936), were used. The transcribed speech was segmented at natural 
boundaries wherever possible and each segment is no more than 10 seconds 
long. The Chinese text is not segmented into words, though there are 
occasional white spaces within some turns.  HKUST Mandarin Telephone 
Transcript Data, Part 1 is distributed via web-download.

------------------------------------------------------------------------

If you need further information, or would like to inquire about 
membership to the LDC, please email ldc at ldc.upenn.edu or call +1 215 573 
2175.

--------------------------------------------------------------------

Linguistic Data Consortium                     Phone: (215) 573-1275
3600 Market Street                             Fax:   (215) 573-2175
Suite 810                             	    	   ldc at ldc.upenn.edu
Philadelphia, PA 19104                 	    http://www.ldc.upenn.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20050803/037c188b/attachment.htm>