[Corpora-List] New LDC Corpora

Wed Aug 24 21:31:43 UTC 2005

  LDC2005T14
Chinese Gigaword Release Second Edition 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T14>

LDC2005S16
MDE RT-04 Training Data Speech 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S16>

LDC2005T24
MDE RT-04 Training Data Text/Annotations 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T24>

The Linguistic Data Consortium (LDC) would like to announce the 
availability of three new corpora.

------------------------------------------------------------------------

(1) Chinese Gigaword Release Second Edition 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T14> 
is a comprehensive archive of newswire text data in Chinese that has 
been acquired over several years by the LDC.
This release includes all of the contents in the first release of the 
Chinese Gigaword corpus (LDC2003T09), material from one new source, as 
well as new materials from the other two sources.  Thus, the corpus 
contains three distinct international sources of Chinese newswire - 
Central News Agency, Taiwan, Xinhua News Agency, and Zaobao.

Some minor updates to the documents from the first release have been 
made; namely, the text portions of "story" type documents have been 
line-wrapped such that each line does not exceed 40 characters. 
Documents of the other types have not been modified. 

(2) MDE RT-04 Training Data Speech 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S16> 
was created  to provide training data for the RT-04 Fall Metadata 
Extraction (MDE) Evaluation, part of the DARPA EARS (Efficient, 
Affordable, Reusable Speech-to-Text) Program. The goal of MDE is to 
enable technology that can take raw Speech-to-Text output and refine it 
into forms that are of more use to humans and to downstream automatic 
processes. In simple terms, this means the creation of automatic 
transcripts that are maximally readable. This readability might be 
achieved in a number of ways: flagging non-content words like filled 
pauses and discourse markers for optional removal; marking sections of 
disfluent speech; and creating boundaries between natural breakpoints in 
the flow of speech so that each sentence or other meaningful unit of 
speech might be presented on a separate line within the resulting 
transcript. Natural capitalization, punctuation and standardized 
spelling, plus sensible conventions for representing speaker turns and 
identity are further elements in the readable transcript. LDC has 
defined a SimpleMDE annotation task specification and has annotated 
English telephone and broadcast news data to provide training data for 
MDE. 

(3) MDE RT-04 Training Data Text/Annotations 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T24> 
was created  to provide training data for the RT-04 Fall Metadata 
Extraction (MDE) Evaluation, part of the DARPA EARS (Efficient, 
Affordable, Reusable Speech-to-Text) Program.  In this release, some 
original annotations have been re-mapped to new MDE elements to support 
better annotation consistency. In particular, the mapping affects 
Discourse Responses (DR), Discourse Markers (DM) and Backchannel SUs (BC). 

------------------------------------------------------------------------

If you need further information, or would like to inquire about 
membership to the LDC, please email ldc at ldc.upenn.edu or call +1 215 573 
2175.

--------------------------------------------------------------------

Linguistic Data Consortium                     Phone: (215) 573-1275
3600 Market Street                             Fax:   (215) 573-2175
Suite 810                             	    	   ldc at ldc.upenn.edu
Philadelphia, PA 19104                 	    http://www.ldc.upenn.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20050824/58aca3c2/attachment.htm>