[Corpora-List] New LDC Corpora

Tue Sep 16 15:31:49 UTC 2003

                           LDC2003T11
                    *   ACE-2 Version 1.0   *

                           LDC2003T13
            *   Message Understanding Conference (MUC) 6   *

The Linguistic Data Consortium (LDC) is pleased to announce the
availability of two new corpora.

                               *

ACE-2 Version 1.0 supports the Automatic Content Extraction (ACE)
program whose objective is to develop extraction technology to support
automatic processing of source language data. This includes
classification, filtering, and selection based on the language content
of the source data, i.e., based on the meaning conveyed by the data.
Thus, the ACE program requires the development of technologies that
automatically detect and characterize this meaning. The ACE research
objectives are viewed as the detection and characterization of Entities,
Relations, and Events.

Annotations for the ACE-2 corpus concern two research tasks: Entity
Detection and Tracking (EDT) and Relation Detection and Characterization
(RDC).  ACE-2 contains two sets of data: training and devtest. Each of
these sets is further divided by source: broadcast news, newspaper, and
newswire. There are 179,007 words of source data in 519 files.

For further information about this corpus, including a link to online
documentation and the NIST ACE program site, please visit:

http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T11

Institutions that have membership in the LDC during the 2003
Membership Year will be able to receive this corpus free of charge.
Nonmembers may license this publication for $500.

	                      *

In the 1990s, the MUC evaluations funded the development of metrics and
statistical algorithms to support government evaluations of emerging
information extraction technologies.  The Message Understanding
Conference (MUC) 6 corpus contains 318 annotated Wall Street Journal
articles, scoring software, and corresponding documentation used in the
MUC 6 evaluation. Both the MUC 6 Additional News Text (LDC96T10) corpus
and the MUC 6 corpus are necessary in order to replicate the evaluation.

All the materials have been published as received from the corpus
authors.  No quality control has been conducted at the LDC; however, the
text files have been uncompressed.

For further information, including online documentation and a link to
the NIST's MUC pages, please visit:

http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T13

Institutions that have membership in the LDC during the 2003
Membership Year will be able to receive this corpus free of charge.
Nonmembers may license this publication for US$100.

		              *

MUC VI Text Collection (LDC96T10) has been renamed MUC 6 Additional News
Text.  The new title more accurately reflects the corpus data as it
consists only of additional training materials for the MUC 6 evaluation.

If you need additional information before placing your order, or
would like to inquire about membership in the LDC, please send email to
 or call (215) 573-1275.

---------------------------------------------------------------------
Linguistic Data Consortium          Phone: (215) 573-1275
3600 Market Street                  Fax:   (215) 573-2175
Suite 810                           email: ldc at ldc.upenn.edu
Philadelphia, PA 19104-2653         www: http://www.ldc.upenn.ed

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20030916/9e574021/attachment.htm>