<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
</head>
<body bgcolor="#ffffff" text="#000000">
<p style="text-align: center;" align="center">LDC2009S02<br>
- <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009S02"><b>Czech
Broadcast Conversation Speech</b></a> -<o:p></o:p></p>
<p style="text-align: center;" align="center">LDC2009T20<b><br>
- <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T20">Czech
Broadcast Conversation MDE Transcripts</a> -</b><o:p></o:p></p>
<p style="text-align: center;" align="center">LDC2009T21<b><br>
- <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T21">S</a><a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T21">panish
Gigaword Second Edition</a> -<br>
</b></p>
<p style="text-align: center;" align="center">The Linguistic Data
Consortium (LDC) would like to announce the availability of three new
publications.<br>
<o:p></o:p></p>
<div class="MsoNormal" style="text-align: center;" align="center">
<hr align="center" size="2" width="100%"></div>
<p class="MsoNormal" style="text-align: center;" align="center"><b>New
Publications</b><o:p></o:p></p>
<p>(1) <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009S02">Czech
Broadcast Conversation Speech</a> was prepared by researchers at the <st1:place><st1:placetype>University</st1:placetype>
of <st1:placename>West Bohemia</st1:placename></st1:place>, <st1:place><st1:city>Pilsen</st1:city>,
<st1:country-region>Czech Republic</st1:country-region></st1:place>,
and
consists of 40 hours of speech from Radioforum, a talk show broadcast
on Czech
Radio 1. Transcripts corresponding to the audio files in this corpus
are
provided in <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T20">Czech
Broadcast Conversation MDE Transcripts (LDC2009T20)</a>. <o:p></o:p></p>
<p>Czech Broadcast Conversation Speech consists of 72 single channel
recordings
of Radioforum, a live talk program broadcast by <a
href="http://www.rozhlas.cz/radiozurnal/portal/">Czech Radio 1 (CRo1)</a>
every
weekday evening. Its format consists of invited guests spontaneously
answering
topical questions posed by one or two interviewers. The number of
interviewees
in a single program varies from one to three, but typically, one
interviewer
and two interviewees appear in the program. The material includes
passages of
interactive dialogue, but longer stretches of monologue-like speech
comprise
the majority of the collected data. Radioforum also has an interactive
segment
where listeners call the studio and ask their own questions. That
telephony
speech was not transcribed in the current release. <o:p></o:p></p>
<p>Individual recordings range from 27 minutes to 36 minutes each. The
recordings were collected during the period from <st1:date year="2003"
day="12" month="2">February 12, 2003</st1:date> through <st1:date
year="2003" day="26" month="6">June 26, 2003</st1:date>. The signal is
mono, sampled at 22.05 kHz
with 16-bit resolution, stored in Windows PCM waveform format. The
names of the
audio files refer to the broadcast date (rfYYMMDD.wav). <o:p></o:p></p>
<p class="MsoNormal" style="text-align: center;" align="center">*<o:p></o:p></p>
<p>(2) <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T20">Czech
Broadcast Conversation MDE Transcripts</a><b> </b>was prepared by
researchers
at the <st1:place><st1:placetype>University</st1:placetype> of <st1:placename>West
Bohemia</st1:placename></st1:place>, <st1:place><st1:city>Pilsen</st1:city>,
<st1:country-region>Czech Republic</st1:country-region></st1:place>,
and consists of approximately 33
hours of transcribed speech from Radioforum, a talk show broadcast on
Czech
Radio 1. The audio files corresponding to the transcripts in this
corpus are
contained in <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009S02">Czech
Broadcast Conversation Speech (LDC2009S02)</a>. <o:p></o:p></p>
<p><span style="color: black;">Czech Broadcast Conversation MDE
Transcripts</span>
was created to extend Metadata Extraction (MDE) research to
conversational <st1:country-region><st1:place>Czech.</st1:place></st1:country-region>
The goal of MDE is to take raw speech recognition output and refine it
into
forms that are of more use to humans and to downstream automatic
processes. In
simple terms, this means the creation of automatic transcripts that are
maximally readable. This readability might be achieved in a number of
ways:
removing non-content words like filled pauses and discourse markers
from the
text; removing sections of disfluent speech; and creating boundaries
between
natural breakpoints in the flow of speech so that each sentence or
other
meaningful unit of speech might be presented on a separate line within
the
resulting transcript. Natural capitalization, punctuation and
standardized
spelling, plus sensible conventions for representing speaker turns and
identity
are further elements in the readable transcript. <o:p></o:p></p>
<p>The transcripts and annotations in this corpus are stored in three
different
formats: TRS (<a href="http://trans.sourceforge.net">Transcriber</a>),
QAn (<a href="http://www.mde.zcu.cz/qan.html">Quick Annotator</a>), and
RTTM. TRS
represents a standard speech transcript. QAn and RTTM also contain
information
about structural metadata (MDE). Character encoding in all files is
ISO-8859-2.
<o:p></o:p></p>
<p>All filenames have the form rfYYMMDD.format where "rf" stands for
Radioforum, the following six digits indicate the date of broadcast,
and the
extension ".format" corresponds to the data format of the particular
file ".trs", ".qan", or ".rttm". <o:p></o:p></p>
<p class="MsoNormal" style="text-align: center;" align="center">*<o:p></o:p></p>
<p>(3) <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T21">Spanish
Gigaword Second Edition</a> is a comprehensive archive of newswire text
data
that has been acquired over several years by LDC. This second edition
updates <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T12">Spanish
Gigaword First Edition (LDC2006T12)</a> and adds data collected from <st1:date
year="2006" day="1" month="1">January 1, 2006</st1:date> through <st1:date
year="2008" day="31" month="12">December 31, 2008</st1:date>. <o:p></o:p></p>
<p>The three distinct international sources of Spanish newswire in this
edition, and the time spans of collection covered for each, are as
follows:<o:p></o:p></p>
<ul type="disc">
<li class="MsoNormal" style="">Agence France-Presse, Spanish Service
(afp_spa) May 1994 - Dec 2008 <o:p></o:p></li>
<li class="MsoNormal" style="">Associated Press Worldstream, Spanish
(apw_spa) Nov 1993 - Dec 2008 <o:p></o:p></li>
<li class="MsoNormal" style="">Xinhua News Agency, Spanish Service
(xin_spa) Sep 2001 - Dec 2008 <o:p></o:p></li>
</ul>
<p>The seven-letter codes in the parentheses above include the
three-character
source name abbreviations and the three-character language code
("spa") separated by an underscore ("_") character. The
three-letter language code conforms to LDC's internal convention based
on the
ISO 639-3 standard. These codes are used in the directory names where
the data
files are found and in the prefix that appears at the beginning of
every data
file name. They are also used (in all UPPER CASE) as the initial
portion of the
DOC "id" strings that uniquely identify each news story.<br>
</p>
<hr size="2" width="100%"><br>
<div align="center"><font face="Courier New, Courier, monospace"><small><small><big>Ilya
Ahtaridis<br>
Membership Coordinator</big><br>
<br>
</small>--------------------------------------------------------------------</small><small><br>
</small></font></div>
<div align="center">
<pre class="moz-signature" cols="72"><font
face="Courier New, Courier, monospace">Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 <a
class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>
Philadelphia, PA 19104 USA <a
class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></font></pre>
</div>
<br>
</body>
</html>