[Corpora-List] New LDC Corpora
Linguistic Data Consortium
ldc at ldc.upenn.edu
Wed Aug 3 20:57:18 UTC 2005
LDC2005T12
*English Gigaword Second Edition*
<http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2005T12>
LDC2005S15
*HKUST Mandarin Telephone Speech, Part 1*
<http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2005S15>
LDC2005T32
*HKUST Mandarin Telephone Transcript Data, Part 1*
<http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2005T32>
The Linguistic Data Consortium (LDC) would like to announce the
availability of three new corpora.
------------------------------------------------------------------------
English Gigaword Second Edition
<http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2005T12>
is a comprehensive archive of newswire text data in English that has
been acquired over several years by the LDC. This release includes all
of the contents in the first release of the English Gigaword corpus
(LDC2003T05) as well as new data from July 2002 through Dec 2004. Some
minor updates to these documents have been made; namely, the text
portions of "story" type documents have been line-wrapped such that each
line does not exceed 80 characters. Documents of the other types have
not been modified. The corpus contains five distinct international
sources of English newswire:
Agence France Press English Service (afe)
Associated Press Worldstream English Service (apw)
Central News Agency of Taiwan English Service (cne)
The New York Times Newswire Service (nyt)
The Xinhua News Agency English Service (xie)
*
The Hong Kong University of Science and Technology (HKUST) collected and
transcribed 200 hours of Mandarin Chinese conversational telephone
speech from Mandarin speakers in mainland China. HKUST Mandarin
Telephone Speech, Part 1
<http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2005S15>
contains the training and development sets with 873 and 24 calls,
respectively.
All calls were operator-assisted, namely, an operator would call two
participants as scheduled to initiate a call. Subjects were asked about
demographic questions before they were bridged for normal conversation.
Their answers to the demographic questions were recorded on separate
files. Subjects were allowed to talk up to 10 minutes. With a few
exceptions, most calls are of the maximum length. Each side of a call
was recorded on a separate wav file, sampled at 8 bits (a-law encoded),
8Khz.
*
HKUST Mandarin Telephone Transcript Data, Part 1
<http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2005T32>
is the corresponding transcription for HKUST Mandarin Telephone Speech
Data, Part 1. Standard simplified Chinese characters, encoded in GBK
(CP-936), were used. The transcribed speech was segmented at natural
boundaries wherever possible and each segment is no more than 10 seconds
long. The Chinese text is not segmented into words, though there are
occasional white spaces within some turns. HKUST Mandarin Telephone
Transcript Data, Part 1 is distributed via web-download.
------------------------------------------------------------------------
If you need further information, or would like to inquire about
membership to the LDC, please email ldc at ldc.upenn.edu or call +1 215 573
2175.
--------------------------------------------------------------------
Linguistic Data Consortium Phone: (215) 573-1275
3600 Market Street Fax: (215) 573-2175
Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 http://www.ldc.upenn.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20050803/037c188b/attachment.htm>
More information about the Corpora
mailing list