Corpora: Broadcast corpus

David Graff graff at unagi.cis.upenn.edu
Mon Jan 17 18:58:25 UTC 2000


I'd like to clarify the availability of corpora from the LDC for working with
topic-sensitive language modeling.  Of the corpora mentioned in earlier
messages (thanks Raman!):

           LDC98T31 1996 CSR Hub-4 Language Model
           LDC97T22 1996 English Broadcast News Transcripts (Hub-4)
           LDC98T28 1997 English Broadcast News Transcripts (Hub-4)
           LDC98T24 1997 Mandarin Broadcast News Transcripts (Hub-4NE)
           LDC98T29 1997 Spanish Broadcast News Transcripts (Hub-4NE)

           LDC99T36 USC Marketplace Broadcast News Transcripts

Only the first of these contains enough bulk to support experiments on
adaptive LM's: it contains a 4.5-year archive of broadcast transcripts
(1992/01 - 1996/06), originally derived from Primary Source Media.  The LDC
has not done any topic annotation on this collection, but the sgml-formatted
text files do preserve a variety of information about each story that was
supplied by PSM, including keywords, story titles and/or headlines.

Unfortunately, due to constraints imposed by copyright owners, all Hub-4
corpora (including the LM collection) are available only to LDC members.  (The
"USC Marketplace" transcripts are available to non-members, but account for
only about 40 hours worth of broadcasts.)

Other corpora that might be useful for topic-based LM research are the TDT
collections:

	 LDC99T39 TDT2 Multilanguage Text
	 LDC99T38 TDT2 Mandarin Text
	 LDC99T37 TDT2 English Text
	 LDC98T25 TDT Pilot Study Corpus

The "Multilanguage" collection is simply the combination of the TDT2 Mandarin
and English collections.  These are available to non-members (please check our
catalog -- www.ldc.upenn.edu/Catalog); the time span covered is only six
months (1998/01-06), but it includes over 700 hours of English broadcasts
(only about 60 hours of Mandarin broadcasts), plus an equivalent amount of
newswire data.  All English stories in the collection are labeled with respect
to 100 selected topics, and all Mandarin stories are labeled with respect to
20 topics (selected from the 100 topics defined for English).

	Dave Graff
	LDC



More information about the Corpora mailing list