Corpora: New Corpora

LDC Office ldc at unagi.cis.upenn.edu
Wed Nov 15 22:17:35 UTC 2000


The Linguistic Data Consortium is pleased to announce 3 new
corpora.

Topic Detection and Tracking(TDT) 2 Careful Transcription Audio
http://www.ldc.upenn.edu/Catalog/LDC2000S92.html
$300 for nonmembers

Topic Detection and Tracking (TDT) 2 Careful Transcription Text
http://www.ldc.upenn.edu/Catalog/LDC2000T44.html
$200 for nonmembers

This realease contains broadcast news speech and transcripts from
the following sources:

ABC	January-June 1998
CNN	January-June 1998
PRI	January-June 1998
VOA	March-June 1998

The audio files are single channel, 16 KHz, 16 bit linear SPHERE
files.  Topic Detection and Tracking (TDT) refers to automatic
techniques for finding topically related material in streams of
data such as newswire and broadcast news. The TDT2 corpus was
created to support three TDT2 tasks: find topically homogeneous
sections (segmentation), detect the occurrence of new events
(detection), and track the reoccurrence of old or new events
(tracking).

TREC Mandarin
http://www.ldc.upenn.edu/Catalog/LDC2000T52.html
AGREEMENT:  http://www.ldc.upenn.edu/Catalog/mem_agree/trec_mandarin.html
$200 for nonmembers

This publication contains the TREC (Text REtreival Conference)
Mandarin Corpus used for the Chinese task in TRECs 5-6 and
consists of approximately 170 megabytes of articles drawn from
the People's Daily newspaper (1991-1993) and the Xinhua newswire
(1994-1995) formatted to include TREC document ids. The text is
Mandarin Chinese and is encoded using the GB encoding scheme. The
topics (questions) and relevance judgments (right answers) are
not included in this publication but can be downloaded from the
Data/Non-English section of the TREC web site.

This collection of text was originally gathered by the Linguistic
Data Consortium (LDC), and then adapted by the National Institute
of Standards and Technology (NIST) for use in the TREC Mandarin
evaluation program.

If you would like to order a copy of these corpora, please email
your request to <ldc at unagi.cis.upenn.edu>.  If you need
additional information before placing your order, or would like
to inquire about membership in the LDC, please send email or
call (215) 573-1275.

Further information about the LDC and its available corpora can
be accessed on the Linguistic Data Consortium WWW Home Page at
URL: http://www.ldc.upenn.edu/



More information about the Corpora mailing list