Corpora: New Releases from the LDC

LDC Office ldc at ldc.upenn.edu
Tue May 8 19:20:57 UTC 2001


The Linguistic Data Consortium (LDC) is pleased to announce the
release of three resources to support research in Topic Detection and
Tracking (TDT) and information retrieval:


1. TDT2 Multilanguage Text Corpus, version 4.0
   LDC2001T57, isbn 1-58563-183-3, 1 CD-ROM
   http://www.ldc.upenn.edu/Catalog/LDC2001T57.html
2. TDT3 Multilanguage Text Corpus, version 2.0
   LDC2001T58, isbn 1-58563-193-0, 1 CD-ROM
   http://www.ldc.upenn.edu/Catalog/LDC2001T58.html
3. TDT3 English Audio Corpus
   LDC2001S94, isbn 1-58563-185-x, 55 CD-ROMs
   http://www.ldc.upenn.edu/Catalog/LDC2001S94.html

You may refer to the LDC's online catalog pages for full
documentation: http://www.ldc.upenn.edu/Catalog/

Topic Detection and Tracking refers to automatic techniques for
finding topically related material in streams of data such as newswire
and broadcast news.  These corpora were created to support the TDT
tasks of: finding topically homogeneous sections (segmentation),
detecting the occurrence of new events  (detection), and tracking the
reoccurrence of old or new events (tracking).  Taken together the
corpora contain audio of broadcast news, news texts including
transcripts of all audio and annotation tables indicating story
boundaries and the relevance of each story to news topics selected
from the collection. The TDT corpora have also been used for
information retrieval, spoken document retrieval and information
extraction.

For further information on TDT please visit:
http://www.ldc.upenn.edu/Projects/TDT.  Brief descriptions of each
corpus are provided below, with information on how to order them.
-------
1. TDT2 Multilanguage Text Corpus, Version 4.0 contains news
data collected daily from nine news sources in two languages (American
English and Mandarin Chinese), over a period of six months (January -
June, 1998).  Both manually-created reference text and automatically-
generated text (ASR and/or machine translation) are provided for all
broadcast and all Mandarin data.

This version has been prepared to complement the first general release
of the TDT3 Multilanguage Text Corpus, providing new enhancements to
make the data content more accessible to a broader research community.

The news sources, and approximate number of stories per source (in
thousands), are as follows:


English sources                              Thousands of
stories
-----------------------------------------------------------------
 New York Times Newswire Service                  11.8
 Associated Press Worldstream Service             12.8
 Cable News Network, "Headline News"              15.8
 American Broadcasting Co., "World News Tonight"   2.1
 Public Radio International, "The World"           2.9
 Voice of America, English news programs           8.2
    Total English stories:                        53.6
thousand

Mandarin sources
-----------------------------------------------------------------
 Xinhua News Agency                               11.3
 Zaobao News Agency                                5.2
 Voice of America, Mandarin Chinese news programs  2.3
    Total Mandarin stories:                       18.8
thousand

Institutions that have membership in the LDC during the
2001 Membership Year will be able to receive this corpus
free of charge. The non-member cost is $2,500.

-------

2. TDT3 Multilanguage Text Corpus Version 2.0 is the first
general release of this collection (version 1 was made available only
to participants in the TDT 1999 and 2000 evaluation tests).  It
contains data from the same nine sources found in TDT2, plus two
additional English television sources.  Like TDT2, it provides both
manually- created and automatically-generated text for most sources.

For TDT3, the daily collection took place over a period of three
months (October - December, 1998).  The sources and approximate number
of stories per source are as follows:


English sources                              Thousands of
stories
-----------------------------------------------------------------
 New York Times Newswire Service                  6.9
 Associated Press Worldstream Service             7.3
 Cable News Network, "Headline News"              9.0
 American Broadcasting Co., "World News Tonight"  1.0
 Public Radio International, "The World"          1.6
 Voice of America, English news programs          3.9
 MS-NBC, "News with Brian Williams"               0.7
 National Broadcasting Co., "NBC Nightly News"    0.8
    Total English stories:                       31.2
thousand

Mandarin sources
-----------------------------------------------------------------
 Xinhua News Agency                               5.2
 Zaobao News Agency                               3.8
 Voice of America, Mandarin Chinese news programs 3.8
    Total Mandarin stories:                      12.8
thousand

Institutions that have membership in the LDC during the
2001 Membership Year will be able to receive this corpus
freeof charge.  The non-member cost is $2,500.

-------

3. TDT3 English Audio Corpus contains the audio (in compressed
sphere format) of news broadcasts collected daily from the 6 news
sources in American English, over the three-month collection period
(October - December 1998).  The sources and amounts are as follows:

Sources Hours  CDs
------------------------------------------------------------------
CNN_HDL Cable News Network, "Headline News" 174.6   19
ABC_WNT American Broadcasting Co., "World News Tonight" 38.6    5
NBC_NNW National Broadcasting Co., "NBC Nightly News" 44.6    6
MNB_NBW  MS-NBC, "News with Brian Williams" 51.8    6
PRI_TWD Public Radio International, "The World" 63.9   7
VOA_ENG Voice of America, English  news programs 102.2   12


Total 475.7   55

The files in this publication are complete single-channel recordings
of the (thirty or sixty minute) broadcasts listed above.  Each one has
been digitized at a sample rate of 16 KHz using 16-bit samples, and
compressed using the "shorten" algorithm.

Institutions that have commercial membership in the LDC during the
2001 Membership Year will be able to receive this corpus free of
charge.  Institutions that have non-profit membership in the 2001
Membership Year will need to pay a media fee of $1,100 for the full
set of 55 CD-ROMS.  The non-member cost for the full set is $11,000.

(The audio CD-ROMs are grouped into subsets by broadcast source, and
the LDC will support the option of purchasing one or more subsets,
e.g. just the VOA data.  We regret that we cannot provide "customized"
subsets.)

If you would like to order a copy of any of these corpora, please
email your request to mailto://ldc@ldc.upenn.edu. If you need
additional information before placing your order, or would like to
inquire about membership in the LDC, please send email or call
(215)573-1275.



More information about the Corpora mailing list