Corpora: Two New Releases from the LDC

LDC Office ldc at ldc.upenn.edu
Tue Jul 17 20:11:09 UTC 2001


The Linguistic Data Consortium (LDC) is pleased to announce the
availability of two new releases.

1.  Message Understanding Conference (MUC) 7
LDC2001T02, isbn 1-58563-205-8, ftp file
http://www.ldc.upenn.edu/Catalog/LDC2001T02.html

2. CALLHOME Spanish Dialogue Act Annotation
LDC2001T61, isbn 1-58563-197-3, ftp file
http://www.ldc.upenn.edu/Catalog/LDC2001T61.html


--


1. The Message Understanding Conference (MUC) 7 corpus contains texts
and annotations of newswire files drawn from the 1996 NY Times News
Wire.  These newswire files were used in the Message Understanding
Conference (MUC) 7 proceedings for the development of information
extraction systems.

Some excerpts from the NIST Information Extraction web page:
http://www.itl.nist.gov/iaui/894.02/related_projects/muc/index.html
are presented below.

Information extraction systems have been evaluated under the support of
DARPA and other government agencies for almost a decade. Since early
1990, the MUC evaluations have been funding the development of metrics
and statistical algorithms to support government evaluations of emerging
information extraction technologies.

In the mid-nineties MUC evaluations began to provide prepared data and
task definitions in addition to providing fully automated scoring
software to measure machine and human performance. The tasks grew from
just production of a database of events found in newswire articles from
one source to the production of multiple databases of increasingly
complex information extracted from multiple sources of news in multiple
languages. The databases now include named entities, multilingual named
entities, attributes of those entities, facts about relationships
between entities, and events in which the entities participated.

The results of these evaluations were reported at conferences during the
1990's where developers and evaluators shared their findings and
government specialists described their needs. These conferences were
called 'Message Understanding Conferences (MUC)' as a result of the use
of such technology to process military messages.


Institutions that have membership in the LDC during the 2001 Membership
Year will be able to receive this corpus free of charge. The non-member
cost is $100.  Please note that there is also an associated user
agreement for both members and nonmembers.



2.  CALLHOME Spanish Dialogue Act Annotation was developed under Project
CLARITY.  The goal of CLARITY was to glean discourse information from
unrestricted conversational speech using shallow corpus-based analysis.
The annotation was carried out at Interactive Systems Labs at Carnegie
Mellon University.

This ftp publication used a three level coding scheme to manually tag
the LDC publication, CALLHOME Spanish Transcripts:
http://www.ldc.upenn.edu/Catalog/LDC96T17.html
The three levels of the coding scheme are:

1.  a dialogue act level consisting of a tag set extended from DAMSL
and Switchboard

2.  a dialogue game level featuring short sequences of dialogue acts

3.  a genre level similar to topical segments.

All 120 dialogues have been annotated.  This publication contains
approximately 11,835 unique words and 211,940 total words.


Dialogue games are short sequences of dialogue acts such as
question/answer pairs.  Genres include storytelling, discussion, and
planning.  Segmentation takes topics into account as well.  Genres,
games and dialogue acts are annotated by type.  Genres are additionally
annotated for activities and topics (on a 0-5 scale) for the central
object or person being discussed (the 'who' or 'what'
category); they contain a short synopsis of the segment.

Papers on annotation schemes from the 1999 ACL Workshop for Discourse
Tagging and LREC-2000 and technical papers on automatic detection are
available at the Interactive Systems Labs site:
http://www.is.cs.cmu.edu

Institutions that have membership in the LDC during the 2001 Membership
Year will be able to receive this corpus free of charge. The non-member
cost is $600


--


If you would like to order a copy of this corpus, please email your
request to <ldc at ldc.upenn.edu>.  User agreements may be faxed to
215.573.2175.

If you need additional information before placing your order, or would
like to inquire about membership in the LDC, please send email or call
215.573.1275.



More information about the Corpora mailing list