[Corpora-List] New LDC Corpora
Linguistic Data Consortium
ldc at ldc.upenn.edu
Wed Aug 24 21:31:43 UTC 2005
LDC2005T14
Chinese Gigaword Release Second Edition
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T14>
LDC2005S16
MDE RT-04 Training Data Speech
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S16>
LDC2005T24
MDE RT-04 Training Data Text/Annotations
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T24>
The Linguistic Data Consortium (LDC) would like to announce the
availability of three new corpora.
------------------------------------------------------------------------
(1) Chinese Gigaword Release Second Edition
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T14>
is a comprehensive archive of newswire text data in Chinese that has
been acquired over several years by the LDC.
This release includes all of the contents in the first release of the
Chinese Gigaword corpus (LDC2003T09), material from one new source, as
well as new materials from the other two sources. Thus, the corpus
contains three distinct international sources of Chinese newswire -
Central News Agency, Taiwan, Xinhua News Agency, and Zaobao.
Some minor updates to the documents from the first release have been
made; namely, the text portions of "story" type documents have been
line-wrapped such that each line does not exceed 40 characters.
Documents of the other types have not been modified.
(2) MDE RT-04 Training Data Speech
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S16>
was created to provide training data for the RT-04 Fall Metadata
Extraction (MDE) Evaluation, part of the DARPA EARS (Efficient,
Affordable, Reusable Speech-to-Text) Program. The goal of MDE is to
enable technology that can take raw Speech-to-Text output and refine it
into forms that are of more use to humans and to downstream automatic
processes. In simple terms, this means the creation of automatic
transcripts that are maximally readable. This readability might be
achieved in a number of ways: flagging non-content words like filled
pauses and discourse markers for optional removal; marking sections of
disfluent speech; and creating boundaries between natural breakpoints in
the flow of speech so that each sentence or other meaningful unit of
speech might be presented on a separate line within the resulting
transcript. Natural capitalization, punctuation and standardized
spelling, plus sensible conventions for representing speaker turns and
identity are further elements in the readable transcript. LDC has
defined a SimpleMDE annotation task specification and has annotated
English telephone and broadcast news data to provide training data for
MDE.
(3) MDE RT-04 Training Data Text/Annotations
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T24>
was created to provide training data for the RT-04 Fall Metadata
Extraction (MDE) Evaluation, part of the DARPA EARS (Efficient,
Affordable, Reusable Speech-to-Text) Program. In this release, some
original annotations have been re-mapped to new MDE elements to support
better annotation consistency. In particular, the mapping affects
Discourse Responses (DR), Discourse Markers (DM) and Backchannel SUs (BC).
------------------------------------------------------------------------
If you need further information, or would like to inquire about
membership to the LDC, please email ldc at ldc.upenn.edu or call +1 215 573
2175.
--------------------------------------------------------------------
Linguistic Data Consortium Phone: (215) 573-1275
3600 Market Street Fax: (215) 573-2175
Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 http://www.ldc.upenn.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20050824/58aca3c2/attachment.htm>
More information about the Corpora
mailing list