[Corpora-List] CNN Transcripts

David Graff graff at ldc.upenn.edu
Wed Nov 16 18:33:55 UTC 2005


To clarify about the LDC's releases of CNN transcripts, there are actually
several corpora currently available, all of which have distinct,
non-overlapping content:

LDC97T22        1996 English Broadcast News Transcripts (Hub-4)
LDC98T28        1997 English Broadcast News Transcripts (Hub-4)
LDC98T31        1996 CSR Hub-4 Language Model
LDC2001T57      TDT2 Multilanguage Text Version 4.0
LDC2001T58      TDT3 Multilanguage Text Version 2.0
LDC2005T16      TDT4 Multilingual Text and Annotations

The two "Broadcast News Transcripts (Hub-4)" corpora were transcribed
manually from various CNN programs recorded in 1996 and 1997; these corpora
also include manual transcripts from other network news broadcasts (ABC,
CSPAN, PBS, etc), for a total overall content of about 200 hours of audio.

The "Hub-4 Language Model" comprises a large archive of older transcripts 
(obtained from a commercial archive, "Primary Source Media"), spanning 
Jan. 1992 - April 1996; again, CNN programs are included along with 
transcripts from numerous other broadcast news sources.

The TDT corpora have data drawn from CNN Headline News (not any other CNN 
programming), in the form of closed-caption texts captured from the 
broadcasts; other network sources are included, covering thousands of 
hours of audio.  The TDT corpora also include newswire text data.

Regarding the two corpora cited by Mark Davies:

 - LDC98T25 was actually a "pilot" corpus for the first phase of the TDT
project (Topic Detection and Tracking), which contains a subset of CNN data
from the "Hub-4 Language Model" collection.

 - LDC2003T11 is a corpus annotated specifically for the "ACE" project
(Automatic Content Extraction), which contains a subset of the TDT2 corpus.


-----------
David Graff			Linguistic Data Consortium
graff at ldc.upenn.edu		3600 Market St., Suite 810
University of Pennsylvania	Philadelphia, PA 19104
		http://www.ldc.upenn.edu


Mark_Davies at byu.edu said:
> I'm also aware of some LDC Corpora that contain CNN transcripts, but in
> general these appear to be either from the newspaper or from scripted
> news broadcasts, e.g.:
>
> http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC98T25
> http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T11
>
> At any rate, even though the genre/register of these transcripts is
> fairly homogenous, they do contain more than 170 million words of
> unscripted spoken English, so it seems like it might be a nice resource.
>
> Thanks in advance for any information that you might have. 



More information about the Corpora mailing list