[Corpora-List] CNN Transcripts

Mark Davies Mark_Davies at byu.edu
Wed Nov 16 17:31:10 UTC 2005


Has anyone here done much with the CNN transcripts:
http://transcripts.cnn.com/TRANSCRIPTS/ ? 

I'm aware of one publication (below), but would be interested in others
as well:

Hoffmann, Sebastian. "From Web-Page to Mega-Corpus: The CNN
Transcripts." In: Marianne Hundt, Nadja Nesselhauf and Carolin Biewer
(eds.) Corpus Linguistics and the Web. Amsterdam: Rodopi.

I'm also aware of some LDC Corpora that contain CNN transcripts, but in
general these appear to be either from the newspaper or from scripted
news broadcasts, e.g.:

http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC98T25
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T11

At any rate, even though the genre/register of these transcripts is
fairly homogenous, they do contain more than 170 million words of
unscripted spoken English, so it seems like it might be a nice resource.

Thanks in advance for any information that you might have.

Mark Davies

=================================================

Mark Davies
Assoc. Prof., Linguistics
Brigham Young University
(phone) 801-422-9168 / (fax) 801-422-0906

http://davies-linguistics.byu.edu

** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **

================================================= 



More information about the Corpora mailing list