[Corpora-List] CNN Transcripts

Wed Nov 16 19:03:39 UTC 2005

Several new LDC corpora currently under development include transcripts 
harvested from CNN and other broadcaster websites (per the data 
licensing agreements we have negotiated with the copyright holders). 
Previous LDC corpora containing CNN material uses transcripts derived 
from closed-captioning, or in some cases manually-created transcripts.

The CNN transcript archive is particularly nice because in most cases 
they are verbatim transcripts including speaker attribution, not scripts 
or summaries.  Most data providers include scripts rather than full 
transcripts, if they feature "transcripts" on their site.

Stephanie

Mark Davies wrote:
> Has anyone here done much with the CNN transcripts:
> http://transcripts.cnn.com/TRANSCRIPTS/ ? 
> 
> I'm aware of one publication (below), but would be interested in others
> as well:
> 
> Hoffmann, Sebastian. "From Web-Page to Mega-Corpus: The CNN
> Transcripts." In: Marianne Hundt, Nadja Nesselhauf and Carolin Biewer
> (eds.) Corpus Linguistics and the Web. Amsterdam: Rodopi.
> 
> I'm also aware of some LDC Corpora that contain CNN transcripts, but in
> general these appear to be either from the newspaper or from scripted
> news broadcasts, e.g.:
> 
> http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC98T25
> http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T11
> 
> At any rate, even though the genre/register of these transcripts is
> fairly homogenous, they do contain more than 170 million words of
> unscripted spoken English, so it seems like it might be a nice resource.
> 
> Thanks in advance for any information that you might have.
> 
> Mark Davies
> 
> =================================================
> 
> Mark Davies
> Assoc. Prof., Linguistics
> Brigham Young University
> (phone) 801-422-9168 / (fax) 801-422-0906
> 
> http://davies-linguistics.byu.edu
> 
> ** Corpus design and use // Linguistic databases **
> ** Historical linguistics // Language variation **
> ** English, Spanish, and Portuguese **
> 
> ================================================= 

-- 
Stephanie Strassel
Associate Director, Annotation Research & Program Coordination
Linguistic Data Consortium
3600 Market Street, Suite 810  Philadelphia, PA 19104-2653 USA
phone: 215-898-9681, fax: 215-573-2175
strassel at ldc.upenn.edu
http://www.ldc.upenn.edu