<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
</head>
<body bgcolor="#ffffff" text="#000000">
<p align="center">LDC2009S01<br>
<big><font face="Times New Roman, Times, serif" size="2"><big>- <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009S01"><b>CSLU:
Numbers Version 1.3</b></a> -<br>
</big></font></big></p>
<p align="center"><big><font face="Times New Roman, Times, serif"
size="2"><big> LDC2009T01</big></font></big><br>
<b>- <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T01">English
CTS Treebank with Structural Metadata</a> -<br>
</b></p>
<p align="center">LDC2009T02<br>
<b>- <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T02">GALE
Phase 1 Chinese Broadcast Conversation Parallel Text - Part 1</a> -<br>
</b></p>
<p align="center">The Linguistic Data
Consortium (LDC) would like to
announce the availability of three new publications.<b><br>
</b></p>
<hr size="2" width="100%">
<p style="margin-bottom: 12pt; text-align: center;" align="center"><b>New
Publications<br>
</b></p>
<p>(1) <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009S01">CSLU:
Numbers Version 1.3</a> was created by the Center for Spoken Language
Understanding (CSLU) at OGI School of Science and Engineering, Oregon
Health
and <st1:place><st1:placename>Science</st1:placename> <st1:placetype>University</st1:placetype></st1:place>,
<st1:place><st1:city>Beaverton</st1:city>, <st1:state>Oregon</st1:state></st1:place>.
It is a collection of naturally produced numbers taken from utterances
in
various CSLU telephone speech data collections. The corpus consists of
approximately fifteen hours of speech and includes isolated digit
strings,
continuous digit strings, and ordinal/cardinal numbers. <o:p></o:p></p>
<p>The numbers have several sources, among them, phone numbers, numbers
from
street addresses and zip codes, uttered by 12618 speakers in a total of
23902
files. In most of CSLU's telephone data collections, callers were asked
for
their phone number, date of birth, or zip code. Callers would also
occasionally
leave numbers in the midst of another utterance. The numbers in those
situations were extracted from the host utterance and added to the
corpus.<o:p></o:p></p>
<p>Each file includes an orthographic transcription following the CSLU
Labeling
guidelines which are included in the documentation for this
publication. Also,
many of the utterances have been phonetically labeled. <o:p></o:p></p>
<p class="MsoNormal" style="text-align: center;" align="center">*<o:p></o:p></p>
<p class="MsoNormal"><br>
(2) <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T01">English
CTS Treebank with Structural Metadata</a> consists of metadata and
syntactic
structure annotations for 144 English telephone conversations, or
140,000
words, from data used in the <a
href="http://projects.ldc.upenn.edu/EARS/">EARS
(Effective, Affordable, Reusable Speech-to-Text program</a>. English
CTS
Treebank with Structural Metadata was created to support EARS work in
English.
It applies EARS metadata extraction annotations and Penn Treebank
methods to
conversations from <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC97S62">Switchboard-1
Release 2 (LDC97S62)</a> and from data collected for EARS under the
Fisher
Protocol<strong>.</strong><b><br>
<br>
</b>The purpose of the EARS program was to develop robust speech
recognition
technology to address a range of languages and speaking styles. LDC
provided conversational
and broadcast speech and transcripts, annotations, lexicons and texts
for
language modeling in each of the EARS languages (Arabic, Chinese,
English). LDC
also supported a <a href="http://projects.ldc.upenn.edu/MDE">metadata
extraction
(MDE) research evaluation</a>, the goal of which was to enable
technology to
take raw speech-to-text (STT) output and refine it into forms of more
use to
humans and to downstream automatic processes. In simple terms, this
means the
creation of automatic transcripts that are maximally readable. <br>
<br>
<i>Structural Metadata Annotation</i>: The Fisher data was carefully
transcribed by LDC staff using <a
href="http://projects.ldc.upenn.edu/Transcription/rt-04/RT-04-guidelines-V3.1.pdf">RT-04
Transcription Specification, Version 3.1</a>; for the Switchboard data,
transcripts developed at the Institute for Signal and Information
Processing at
<st1:place><st1:placename>Mississippi</st1:placename> <st1:placetype>State</st1:placetype>
<st1:placetype>University</st1:placetype></st1:place> were used. The
transcribed data was annotated to <a
href="http://projects.ldc.upenn.edu/MDE/Guidelines/SimpleMDE_V6.2.pdf">SimpleMDE
V6.2 </a>, an annotation task defined by LDC that consisted of the
following
elements: Edit Disfluencies (repetitions, revisions, restarts and
complex
disfluencies), Fillers (including, e.g., filled pauses and discourse
markers)
and SUs, or syntactic/semantic units. <o:p></o:p></p>
<p><i>Parsing and Treebank Annotation</i>: The existing MDE
annotations
were converted from RTTM format into a format appropriate for the
automatic
parser, enabling the generation of accurate parses in a form that would
require
as little hand modification by the Treebank team as possible. RTTM is a
format
developed by NIST (National Institute for Standards and Technology) for
the
EARS program that labeled each token in the reference transcript
according to
the properties it displays (e.g., lexeme versus non-lexeme, edit,
filler, SU).
The initial parse trees were produced using <a
href="http://www.ldc.upenn.edu/Catalog/docs/LDC2000T43/parser.pdf">an
entropy-based parser</a>. These parses served as the starting point
for a
manual process which corrected the initial pass for each conversation. <o:p></o:p></p>
<p>To provide high quality parses, scripts were used to separate the
edited
material from the fluent part of each SU prior to parsing it using the
MDE
annotations. The edits were then parsed and reinserted into the tree
for
presentation to the annotators. Manual treebank annotation was
performed in
accordance with existing treebank guidelines for conversational
telephone
speech as well as in accordance with revised general guidelines for
treebanking.<o:p></o:p></p>
<p style="text-align: center;" align="center"><b>*</b><o:p></o:p></p>
<p>(3) <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T02">GALE
Phase 1 Chinese Broadcast Conversation Parallel Text - Part 1</a>
contains
transcripts and English translations of 20.4 hours of Chinese broadcast
conversation programming from China Central TV (CCTV) and Phoenix TV.
It does
not contain the audio files form which the transcripts and translations
were
generated. GALE Phase 1 Chinese Broadcast Conversation Parallel Text -
Part 1,
along with other corpora, was used as training data in year 1 (Phase 1)
of the
DARPA-funded GALE program. <span style="font-size: 13.5pt;"> </span><o:p></o:p></p>
<p>A total of 20.4 hours of Chinese broadcast conversation programming
were
selected from two sources: CCTV (a broadcaster from Mainland <st1:country-region><st1:place>China</st1:place></st1:country-region>),
and Phoenix TV (a <st1:place>Hong Kong</st1:place> -based satellite TV
station). The transcripts and translations represent recordings of
eight
different programs. A manual selection procedure was used to choose
data
appropriate for the GALE program, namely, conversation (talk) programs
focusing
on current events. Stories on topics such as sports, entertainment and
business
were excluded from the data set.<o:p></o:p></p>
<p>The selected audio snippets were carefully transcribed by LDC
annotators and
professional transcription agencies following LDC's Quick Rich
Transcription
specification. Manual sentence units/segments (SU) annotation was also
performed as part of the transcription task. Three types of end of
sentence SU
were identified: statement SU, question SU, and incomplete SU.<o:p></o:p></p>
<p>After transcription and SU annotation, files were reformatted into a
human-readable translation format and assigned to professional
translators for
careful translation. Translators followed LDC's GALE Translation
guidelines
which describe the makeup of the translation team, the source data
format, the
translation data format, best practices for translating certain
linguistic
features (such as names and speech disfluencies) and quality control
procedures
applied to completed translations.<br>
</p>
<div class="MsoNormal" style="text-align: center;" align="center">
<hr align="center" size="2" width="100%"></div>
<br>
<div align="center"><font face="Courier New, Courier, monospace"><small><small><big>Ilya
Ahtaridis<br>
Membership Coordinator</big><br>
<br>
</small>--------------------------------------------------------------------</small><small><br>
</small></font></div>
<div align="center">
<pre class="moz-signature" cols="72"><font
face="Courier New, Courier, monospace">Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 <a
class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>
Philadelphia, PA 19104 USA <a
class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></font></pre>
</div>
</body>
</html>