<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
</head>
<body bgcolor="#ffffff" text="#000000">
<div align="center">LDC2006S42<br>
<b><a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S42">Korean
Broadcast News Speech</a></b><br>
<br>
LDC2006T14<br>
<b><a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T14">Korean
Broadcast News Transcripts</a></b><br>
<br>
LDC2006S36<br>
</div>
<div align="center"><b><a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S36">West
Point Korean Speech</a><br>
<br>
</b></div>
<div align="center">
<div align="center">The Linguistic Data
Consortium (LDC) is please to announce the availability of three new
publications.<br>
</div>
<br>
</div>
<hr size="2" width="100%">
<div align="left"><br>
<br>
</div>
(1) <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S42">Korean
Broadcast News Speech</a> consists of 18 audio files recorded by LDC in
January 2000 and February 2000 from Voice of America (VOA) satellite
radio news broadcasts in Korean. The recordings, captured from a
dedicated satellite receiver, are stored as 16-bit PCM, 16-kHz,
single-channel, in NIST SPHERE format. The duration of each recording
is either 30 minutes or 60 minutes, depending on the VOA broadcast
schedule; the date (YYYYMMDD), start-time and end-time (HHMM, Eastern
Standard Time) for each recording are indicated in the file names. The
sample data are not compressed.
<br>
<br>
Transcripts for these recordings are available as a separate corpus
from the LDC: Korean Broadcast News Transcripts, LDC2006T14.<br>
<br>
<div align="center">*<br>
</div>
<br>
(2) <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T14">Korean
Broadcast News Transcripts</a> consists of 18 text files containing
transcripts prepared by the LDC for Voice of America satellite radio
news broadcasts in Korean. The broadcasts were recorded by the LDC at
transmission time during a two week period between January 21, 2000 and
February 7, 2000. Nine of the broadcasts are 30 minutes long, and the
other nine broadcasts are 60 minutes long. The file names indicate the
date (YYYYMMDD)and the begin and end times (HHMM EST) of the original
transmission.
<br>
<br>
The character encoding is Unicode UTF-8, and the file contents are
structured using SGML. The markup strategy used here was defined by
NIST specifically for use in transcripts of broadcast news speech. The
"docs" directory provides a working DTD file, a complete description
(in the form of a PostScript file) of the document structure, tags and
attributes, and a simple text file listing the 18 data file names in
the corpus.
<br>
<br>
The transcripts have been manually time aligned at the phrasal level
and annotated to identify boundaries between news stories and speaker
turns; speaker names and gender are given where identifiable. These
annotations are all provided via the SGML tags and their attributes. A
strong effort has been made to identify all unique speakers across
the transcripts. However, there may be cases where an individual
speaker has not been recognized and has been given a unique, anonymous
identification.
<br>
<br>
Audio files for these transcripts are available as a separate corpus
from the LDC: Korean Broadcast News Speech, LDC2006S42. <br>
<br>
<div align="center">*<br>
</div>
<br>
(3) <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S36">West
Point Korean Speech</a> contains digital recordings of spoken Korean.
Corpus design and data collection were carried out by staff and faculty
of the Department of Foreign Languages (DFL) and Center for Technology
Enhanced Language Learning (CTELL), located at the United States
Military Academy (USMA), West Point, New York. The corpus was designed
to develop speech recognition systems that would be used by the US
government for speech-recognition enhanced language learning courseware
.
<br>
<br>
The prompt scripts were created from 20,000 distinct sentences, along
with a subset of prompts designed to elicit free response answers to
questions for use in domain-specific speech-to-speech translation
systems. Each speaker attempted to record 100 utterances. <br>
<br>
<hr size="2" width="100%"><br>
<div align="center"><font face="Courier New"><small><big><font
face="Times New Roman">If
you need further
information, or would like to inquire about
membership to the LDC, please email <a class="moz-txt-link-abbreviated"
href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a> or call +1 215
573 1275.</font></big></small></font><br>
</div>
<p><font face="Courier New"><small><br>
<br>
</small></font>
</p>
<div align="center">--------------------------------------------------------------------<br>
</div>
<div align="center">
<pre class="moz-signature" cols="72">Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 <a
class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>
Philadelphia, PA 19104 USA <a
class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></pre>
</div>
</body>
</html>