[Corpora-List] CNN Transcripts
Stephanie M. Strassel
strassel at ldc.upenn.edu
Wed Nov 16 19:03:39 UTC 2005
Several new LDC corpora currently under development include transcripts
harvested from CNN and other broadcaster websites (per the data
licensing agreements we have negotiated with the copyright holders).
Previous LDC corpora containing CNN material uses transcripts derived
from closed-captioning, or in some cases manually-created transcripts.
The CNN transcript archive is particularly nice because in most cases
they are verbatim transcripts including speaker attribution, not scripts
or summaries. Most data providers include scripts rather than full
transcripts, if they feature "transcripts" on their site.
Stephanie
Mark Davies wrote:
> Has anyone here done much with the CNN transcripts:
> http://transcripts.cnn.com/TRANSCRIPTS/ ?
>
> I'm aware of one publication (below), but would be interested in others
> as well:
>
> Hoffmann, Sebastian. "From Web-Page to Mega-Corpus: The CNN
> Transcripts." In: Marianne Hundt, Nadja Nesselhauf and Carolin Biewer
> (eds.) Corpus Linguistics and the Web. Amsterdam: Rodopi.
>
> I'm also aware of some LDC Corpora that contain CNN transcripts, but in
> general these appear to be either from the newspaper or from scripted
> news broadcasts, e.g.:
>
> http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC98T25
> http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T11
>
> At any rate, even though the genre/register of these transcripts is
> fairly homogenous, they do contain more than 170 million words of
> unscripted spoken English, so it seems like it might be a nice resource.
>
> Thanks in advance for any information that you might have.
>
> Mark Davies
>
> =================================================
>
> Mark Davies
> Assoc. Prof., Linguistics
> Brigham Young University
> (phone) 801-422-9168 / (fax) 801-422-0906
>
> http://davies-linguistics.byu.edu
>
> ** Corpus design and use // Linguistic databases **
> ** Historical linguistics // Language variation **
> ** English, Spanish, and Portuguese **
>
> =================================================
--
Stephanie Strassel
Associate Director, Annotation Research & Program Coordination
Linguistic Data Consortium
3600 Market Street, Suite 810 Philadelphia, PA 19104-2653 USA
phone: 215-898-9681, fax: 215-573-2175
strassel at ldc.upenn.edu
http://www.ldc.upenn.edu
More information about the Corpora
mailing list