[Corpora-List] Needed: Corpora of radio news segments.
Linguistic Data Consortium
ldc at ldc.upenn.edu
Wed Nov 9 17:36:08 UTC 2005
Hi John,
You might wish to consider the following HUB4 and TDT resources
distributed by the LDC. These data sets contain substantial quantities
of recent broadcast news in several languages, segmented into individual
stories and time-aligned with verbatim transcripts.
LDC97S66 <catalogEntry.jsp?catalogId=LDC97S66> 1996 English Broadcast
News Dev and Eval (Hub-4)
LDC97S44 <catalogEntry.jsp?catalogId=LDC97S44> 1996 English Broadcast
News Speech (Hub-4)
LDC97T22 <catalogEntry.jsp?catalogId=LDC97T22> 1996 English Broadcast
News Transcripts (Hub-4)
LDC98S71 <catalogEntry.jsp?catalogId=LDC98S71> 1997 English Broadcast
News Speech (Hub-4)
LDC98T28 <catalogEntry.jsp?catalogId=LDC98T28> 1997 English Broadcast
News Transcripts (Hub-4)
LDC2002S11 <catalogEntry.jsp?catalogId=LDC2002S11> 1997 HUB4 English
Evaluation Speech and Transcripts
LDC98S73 <catalogEntry.jsp?catalogId=LDC98S73> 1997 Mandarin Broadcast
News Speech (Hub-4NE)
LDC98T24 <catalogEntry.jsp?catalogId=LDC98T24> 1997 Mandarin Broadcast
News Transcripts (Hub-4NE)
LDC98S74 <catalogEntry.jsp?catalogId=LDC98S74> 1997 Spanish Broadcast
News Speech (Hub-4NE)
LDC98T29 <catalogEntry.jsp?catalogId=LDC98T29> 1997 Spanish Broadcast
News Transcripts (Hub-4NE)
LDC2000S86 <catalogEntry.jsp?catalogId=LDC2000S86> 1998 HUB-4 Broadcast
News Evaluation English Test Material
LDC2000S92 <catalogEntry.jsp?catalogId=LDC2000S92> TDT2 Careful
Transcription Audio
LDC2000T44 <catalogEntry.jsp?catalogId=LDC2000T44> TDT2 Careful
Transcription Text
LDC99S84 <catalogEntry.jsp?catalogId=LDC99S84> TDT2 English Audio
LDC2001S93 <catalogEntry.jsp?catalogId=LDC2001S93> TDT2 Mandarin Audio
Corpus
LDC2001T57 <catalogEntry.jsp?catalogId=LDC2001T57> TDT2 Multilanguage
Text Version 4.0
LDC2001S94 <catalogEntry.jsp?catalogId=LDC2001S94> TDT3 English Audio
LDC2001S95 <catalogEntry.jsp?catalogId=LDC2001S95> TDT3 Mandarin Audio
LDC2001T58 <catalogEntry.jsp?catalogId=LDC2001T58> TDT3 Multilanguage
Text Version 2.0
LDC2005S11 <catalogEntry.jsp?catalogId=LDC2005S11> TDT4 Multilingual
Broadcast News Speech Corpus
LDC2005T16 <catalogEntry.jsp?catalogId=LDC2005T16> TDT4 Multilingual
Text and Annotations
You can view our entire online catalog at:
http://www.ldc.upenn.edu/Catalog/
Kind regards,
Ilya
Bryar Family wrote:
>Hello:
>
>I'm developing a project for rapid identification and categorization of
>audio news clips, with a "target communities" focus. Are there any public
>corpora available that consist of individual audio news stories of recent
>vintage? (last 5-10 years)
>
>I'd also be interested in corresponding with any members of the list who are
>developing content categorization strategies for such audio content. For
>example, if there are any members of the list who are involved with the
>NewsML project, I'd like to hear from them.
>
>John V "Jack" Bryar
>Managing Partner and acting CTO,
>MilkBottleNews Partners
>Direct: 802-843-6033
>jack at milkbottlenews.com
>
>
>
--
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 http://www.ldc.upenn.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20051109/92c7b0c6/attachment.htm>
More information about the Corpora
mailing list