<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">

</head>

<body bgcolor="#ffffff" text="#000000">

<div align="center">LDC2006S42<br>

<b><a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S42">Korean

Broadcast News Speech</a></b><br>

<br>

LDC2006T14<br>

<b><a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T14">Korean

Broadcast News Transcripts</a></b><br>

<br>

LDC2006S36<br>

</div>

<div align="center"><b><a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S36">West

Point Korean Speech</a><br>

<br>

</b></div>

<div align="center">

<div align="center">The Linguistic Data

Consortium (LDC) is please to announce the availability of three new

publications.<br>

</div>

<br>

</div>

<hr size="2" width="100%">

<div align="left"><br>

<br>

</div>

(1)  <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S42">Korean

Broadcast News Speech</a> consists of 18 audio files recorded by LDC in

January 2000 and February 2000 from Voice of America (VOA) satellite

radio news broadcasts in Korean.  The recordings, captured from a

dedicated satellite receiver, are stored as 16-bit PCM, 16-kHz,

single-channel, in NIST SPHERE format. The duration of each recording

is either 30 minutes or 60 minutes, depending on the VOA broadcast

schedule; the date (YYYYMMDD), start-time and end-time (HHMM, Eastern

Standard Time) for each recording are indicated in the file names. The

sample data are not compressed.

<br>

<br>

Transcripts for these recordings are available as a separate corpus

from the LDC: Korean Broadcast News Transcripts, LDC2006T14.<br>

<br>

<div align="center">*<br>

</div>

<br>

(2)  <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T14">Korean

Broadcast News Transcripts</a> consists of 18 text files containing

transcripts prepared by the LDC for Voice of America satellite radio

news broadcasts in Korean. The broadcasts were recorded by the LDC at

transmission time during a two week period between January 21, 2000 and

February 7, 2000.  Nine of the broadcasts are 30 minutes long, and the

other nine broadcasts are 60 minutes long. The file names indicate the

date (YYYYMMDD)and the begin and end times (HHMM EST) of the original

transmission.

<br>

<br>

The character encoding is Unicode UTF-8, and the file contents are

structured using SGML. The markup strategy used here was defined by

NIST specifically for use in transcripts of broadcast news speech. The

"docs" directory provides a working DTD file, a complete description

(in the form of a PostScript file) of the document structure, tags and

attributes, and a simple text file listing the 18 data file names in

the corpus.

<br>

<br>

The transcripts have been manually time aligned at the phrasal level

and annotated to identify boundaries between news stories and speaker

turns; speaker names and gender are given where identifiable. These

annotations are all provided via the SGML tags and their attributes.  A

strong effort has been made to identify all unique speakers across

the transcripts. However, there may be cases where an individual

speaker has not been recognized and has been given a unique, anonymous

identification.

<br>

<br>

Audio files for these transcripts are available as a separate corpus

from the LDC: Korean Broadcast News Speech, LDC2006S42.  <br>

<br>

<div align="center">*<br>

</div>

<br>

(3)  <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S36">West

Point Korean Speech</a> contains digital recordings of spoken Korean.

Corpus design and data collection were carried out by staff and faculty

of the Department of Foreign Languages (DFL) and Center for Technology

Enhanced Language Learning (CTELL), located at the United States

Military Academy (USMA), West Point, New York. The corpus was designed

to develop speech recognition systems that would be used by the US

government for speech-recognition enhanced language learning courseware

.

<br>

<br>

The prompt scripts were created from 20,000 distinct sentences, along

with a subset of prompts designed to elicit free response answers to

questions for use in domain-specific speech-to-speech translation

systems. Each speaker attempted to record 100 utterances.  <br>

<br>

<hr size="2" width="100%"><br>

<div align="center"><font face="Courier New"><small><big><font

 face="Times New Roman">If

you need further

information, or would like to inquire about

membership to the LDC, please email <a class="moz-txt-link-abbreviated"

 href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a> or call +1 215

573 1275.</font></big></small></font><br>

</div>

<p><font face="Courier New"><small><br>

<br>

</small></font>

</p>

<div align="center">--------------------------------------------------------------------<br>

</div>

<div align="center">

<pre class="moz-signature" cols="72">Linguistic Data Consortium                     Phone: (215) 573-1275

University of Pennsylvania                       Fax: (215) 573-2175

3600 Market St., Suite 810                         <a

 class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>

Philadelphia, PA 19104 USA                  <a

 class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></pre>

</div>

</body>

</html>