[Corpora-List] Needed: Corpora of radio news segments.

Linguistic Data Consortium ldc at ldc.upenn.edu
Wed Nov 9 17:36:08 UTC 2005


Hi John,

You might wish to consider the following HUB4 and TDT resources 
distributed by the LDC.  These data sets contain substantial quantities 
of recent broadcast news in several languages, segmented into individual 
stories and time-aligned with verbatim transcripts.

LDC97S66 <catalogEntry.jsp?catalogId=LDC97S66> 	1996 English Broadcast 
News Dev and Eval (Hub-4)
LDC97S44 <catalogEntry.jsp?catalogId=LDC97S44> 	1996 English Broadcast 
News Speech (Hub-4)
LDC97T22 <catalogEntry.jsp?catalogId=LDC97T22> 	1996 English Broadcast 
News Transcripts (Hub-4)
LDC98S71 <catalogEntry.jsp?catalogId=LDC98S71> 	1997 English Broadcast 
News Speech (Hub-4)
LDC98T28 <catalogEntry.jsp?catalogId=LDC98T28> 	1997 English Broadcast 
News Transcripts (Hub-4)


LDC2002S11 <catalogEntry.jsp?catalogId=LDC2002S11> 	1997 HUB4 English 
Evaluation Speech and Transcripts
LDC98S73 <catalogEntry.jsp?catalogId=LDC98S73> 	1997 Mandarin Broadcast 
News Speech (Hub-4NE)
LDC98T24 <catalogEntry.jsp?catalogId=LDC98T24> 	1997 Mandarin Broadcast 
News Transcripts (Hub-4NE)
LDC98S74 <catalogEntry.jsp?catalogId=LDC98S74> 	1997 Spanish Broadcast 
News Speech (Hub-4NE)
LDC98T29 <catalogEntry.jsp?catalogId=LDC98T29> 	1997 Spanish Broadcast 
News Transcripts (Hub-4NE)
LDC2000S86 <catalogEntry.jsp?catalogId=LDC2000S86> 	1998 HUB-4 Broadcast 
News Evaluation English Test Material


LDC2000S92 <catalogEntry.jsp?catalogId=LDC2000S92> 	TDT2 Careful 
Transcription Audio
LDC2000T44 <catalogEntry.jsp?catalogId=LDC2000T44> 	TDT2 Careful 
Transcription Text
LDC99S84 <catalogEntry.jsp?catalogId=LDC99S84> 	TDT2 English Audio
LDC2001S93 <catalogEntry.jsp?catalogId=LDC2001S93> 	TDT2 Mandarin Audio 
Corpus
LDC2001T57 <catalogEntry.jsp?catalogId=LDC2001T57> 	TDT2 Multilanguage 
Text Version 4.0
LDC2001S94 <catalogEntry.jsp?catalogId=LDC2001S94> 	TDT3 English Audio
LDC2001S95 <catalogEntry.jsp?catalogId=LDC2001S95> 	TDT3 Mandarin Audio
LDC2001T58 <catalogEntry.jsp?catalogId=LDC2001T58> 	TDT3 Multilanguage 
Text Version 2.0
LDC2005S11 <catalogEntry.jsp?catalogId=LDC2005S11> 	TDT4 Multilingual 
Broadcast News Speech Corpus
LDC2005T16 <catalogEntry.jsp?catalogId=LDC2005T16> 	TDT4 Multilingual 
Text and Annotations


You can view our entire online catalog at:

http://www.ldc.upenn.edu/Catalog/

Kind regards,

Ilya

Bryar Family wrote:

>Hello:
>
>I'm developing a project for rapid identification and categorization of
>audio news clips, with a "target communities" focus. Are there any public
>corpora available that consist of individual audio news stories of recent
>vintage? (last 5-10 years)
>
>I'd also be interested in corresponding with any members of the list who are
>developing content categorization strategies for such audio content. For
>example, if there are any members of the list who are involved with the
>NewsML project, I'd like to hear from them. 
>
>John V "Jack" Bryar
>Managing Partner and acting CTO,
>MilkBottleNews Partners
>Direct: 802-843-6033
>jack at milkbottlenews.com
>
>  
>

-- 


Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium                    Phone: (215) 573-1275
University of Pennsylvania                    Fax:   (215) 573-2175
3600 Market St., Suite 810                        ldc at ldc.upenn.edu
Philadelphia, PA 19104                     http://www.ldc.upenn.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20051109/92c7b0c6/attachment.htm>


More information about the Corpora mailing list