[Corpora-List] New from the LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Thu Aug 31 19:22:53 UTC 2006


LDC2006S42
*Korean Broadcast News Speech 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S42>*

LDC2006T14
*Korean Broadcast News Transcripts 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T14>*

LDC2006S36
*West Point Korean Speech 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S36>

*
The Linguistic Data Consortium (LDC) is please to announce the 
availability of three new publications.

------------------------------------------------------------------------


(1)  Korean Broadcast News Speech 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S42> 
consists of 18 audio files recorded by LDC in January 2000 and February 
2000 from Voice of America (VOA) satellite radio news broadcasts in 
Korean.  The recordings, captured from a dedicated satellite receiver, 
are stored as 16-bit PCM, 16-kHz, single-channel, in NIST SPHERE format. 
The duration of each recording is either 30 minutes or 60 minutes, 
depending on the VOA broadcast schedule; the date (YYYYMMDD), start-time 
and end-time (HHMM, Eastern Standard Time) for each recording are 
indicated in the file names. The sample data are not compressed.

Transcripts for these recordings are available as a separate corpus from 
the LDC: Korean Broadcast News Transcripts, LDC2006T14.

*

(2)  Korean Broadcast News Transcripts 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T14> 
consists of 18 text files containing transcripts prepared by the LDC for 
Voice of America satellite radio news broadcasts in Korean. The 
broadcasts were recorded by the LDC at transmission time during a two 
week period between January 21, 2000 and February 7, 2000.  Nine of the 
broadcasts are 30 minutes long, and the other nine broadcasts are 60 
minutes long. The file names indicate the date (YYYYMMDD)and the begin 
and end times (HHMM EST) of the original transmission.

The character encoding is Unicode UTF-8, and the file contents are 
structured using SGML. The markup strategy used here was defined by NIST 
specifically for use in transcripts of broadcast news speech. The "docs" 
directory provides a working DTD file, a complete description (in the 
form of a PostScript file) of the document structure, tags and 
attributes, and a simple text file listing the 18 data file names in the 
corpus.

The transcripts have been manually time aligned at the phrasal level and 
annotated to identify boundaries between news stories and speaker turns; 
speaker names and gender are given where identifiable. These annotations 
are all provided via the SGML tags and their attributes.  A strong 
effort has been made to identify all unique speakers across the 
transcripts. However, there may be cases where an individual speaker has 
not been recognized and has been given a unique, anonymous identification.

Audio files for these transcripts are available as a separate corpus 
from the LDC: Korean Broadcast News Speech, LDC2006S42. 

*

(3)  West Point Korean Speech 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S36> 
contains digital recordings of spoken Korean. Corpus design and data 
collection were carried out by staff and faculty of the Department of 
Foreign Languages (DFL) and Center for Technology Enhanced Language 
Learning (CTELL), located at the United States Military Academy (USMA), 
West Point, New York. The corpus was designed to develop speech 
recognition systems that would be used by the US government for 
speech-recognition enhanced language learning courseware .

The prompt scripts were created from 20,000 distinct sentences, along 
with a subset of prompts designed to elicit free response answers to 
questions for use in domain-specific speech-to-speech translation 
systems. Each speaker attempted to record 100 utterances. 

------------------------------------------------------------------------

If you need further information, or would like to inquire about 
membership to the LDC, please email ldc at ldc.upenn.edu or call +1 215 573 
1275.



--------------------------------------------------------------------

Linguistic Data Consortium                     Phone: (215) 573-1275
University of Pennsylvania                       Fax: (215) 573-2175
3600 Market St., Suite 810                         ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA                  http://www.ldc.upenn.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20060831/18b55619/attachment.htm>


More information about the Corpora mailing list