<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

</head>

<body bgcolor="#ffffff" text="#000000">

<p align="center">LDC2009S01<br>

<big><font face="Times New Roman, Times, serif" size="2"><big>-  <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009S01"><b>CSLU:

Numbers Version 1.3</b></a>  -<br>

</big></font></big></p>

<p align="center"><big><font face="Times New Roman, Times, serif"

 size="2"><big> LDC2009T01</big></font></big><br>

<b>-  <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T01">English

CTS Treebank with Structural Metadata</a>  -<br>

</b></p>

<p align="center">LDC2009T02<br>

<b>-  <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T02">GALE

Phase 1 Chinese Broadcast Conversation Parallel Text - Part 1</a>  -<br>

</b></p>

<p align="center">The Linguistic Data

Consortium (LDC) would like to

announce the availability of three new publications.<b><br>

</b></p>

<hr size="2" width="100%">

<p style="margin-bottom: 12pt; text-align: center;" align="center"><b>New

Publications<br>

</b></p>

<p>(1) <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009S01">CSLU:

Numbers Version 1.3</a> was created by the Center for Spoken Language

Understanding (CSLU) at OGI School of Science and Engineering, Oregon

Health

and <st1:place><st1:placename>Science</st1:placename> <st1:placetype>University</st1:placetype></st1:place>,

<st1:place><st1:city>Beaverton</st1:city>, <st1:state>Oregon</st1:state></st1:place>.

It is a collection of naturally produced numbers taken from utterances

in

various CSLU telephone speech data collections. The corpus consists of

approximately fifteen hours of speech and includes isolated digit

strings,

continuous digit strings, and ordinal/cardinal numbers. <o:p></o:p></p>

<p>The numbers have several sources, among them, phone numbers, numbers

from

street addresses and zip codes, uttered by 12618 speakers in a total of

23902

files. In most of CSLU's telephone data collections, callers were asked

for

their phone number, date of birth, or zip code. Callers would also

occasionally

leave numbers in the midst of another utterance. The numbers in those

situations were extracted from the host utterance and added to the

corpus.<o:p></o:p></p>

<p>Each file includes an orthographic transcription following the CSLU

Labeling

guidelines which are included in the documentation for this

publication. Also,

many of the utterances have been phonetically labeled. <o:p></o:p></p>

<p class="MsoNormal" style="text-align: center;" align="center">*<o:p></o:p></p>

<p class="MsoNormal"><br>

(2) <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T01">English

CTS Treebank with Structural Metadata</a> consists of metadata and

syntactic

structure annotations for 144 English telephone conversations, or

140,000

words, from data used in the <a

 href="http://projects.ldc.upenn.edu/EARS/">EARS

(Effective, Affordable, Reusable Speech-to-Text program</a>. English

CTS

Treebank with Structural Metadata was created to support EARS work in

English.

It applies EARS metadata extraction annotations and Penn Treebank

methods to

conversations from <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC97S62">Switchboard-1

Release 2 (LDC97S62)</a> and from data collected for EARS under the

Fisher

Protocol<strong>.</strong><b><br>

<br>

</b>The purpose of the EARS program was to develop robust speech

recognition

technology to address a range of languages and speaking styles. LDC

provided conversational

and broadcast speech and transcripts, annotations, lexicons and texts

for

language modeling in each of the EARS languages (Arabic, Chinese,

English). LDC

also supported a <a href="http://projects.ldc.upenn.edu/MDE">metadata

extraction

(MDE) research evaluation</a>, the goal of which was to enable

technology to

take raw speech-to-text (STT) output and refine it into forms of more

use to

humans and to downstream automatic processes. In simple terms, this

means the

creation of automatic transcripts that are maximally readable. <br>

<br>

<i>Structural Metadata Annotation</i>:  The Fisher data was carefully

transcribed by LDC staff using <a

 href="http://projects.ldc.upenn.edu/Transcription/rt-04/RT-04-guidelines-V3.1.pdf">RT-04

Transcription Specification, Version 3.1</a>; for the Switchboard data,

transcripts developed at the Institute for Signal and Information

Processing at

<st1:place><st1:placename>Mississippi</st1:placename> <st1:placetype>State</st1:placetype>

<st1:placetype>University</st1:placetype></st1:place> were used. The

transcribed data was annotated to <a

 href="http://projects.ldc.upenn.edu/MDE/Guidelines/SimpleMDE_V6.2.pdf">SimpleMDE

V6.2 </a>, an annotation task defined by LDC that consisted of the

following

elements: Edit Disfluencies (repetitions, revisions, restarts and

complex

disfluencies), Fillers (including, e.g., filled pauses and discourse

markers)

and SUs, or syntactic/semantic units. <o:p></o:p></p>

<p><i>Parsing and Treebank Annotation</i>:  The existing MDE

annotations

were converted from RTTM format into a format appropriate for the

automatic

parser, enabling the generation of accurate parses in a form that would

require

as little hand modification by the Treebank team as possible. RTTM is a

format

developed by NIST (National Institute for Standards and Technology) for

the

EARS program that labeled each token in the reference transcript

according to

the properties it displays (e.g., lexeme versus non-lexeme, edit,

filler, SU).

The initial parse trees were produced using <a

 href="http://www.ldc.upenn.edu/Catalog/docs/LDC2000T43/parser.pdf">an

entropy-based parser</a>.  These parses served as the starting point

for a

manual process which corrected the initial pass for each conversation. <o:p></o:p></p>

<p>To provide high quality parses, scripts were used to separate the

edited

material from the fluent part of each SU prior to parsing it using the

MDE

annotations. The edits were then parsed and reinserted into the tree

for

presentation to the annotators. Manual treebank annotation was

performed in

accordance with existing treebank guidelines for conversational

telephone

speech as well as in accordance with revised general guidelines for

treebanking.<o:p></o:p></p>

<p style="text-align: center;" align="center"><b>*</b><o:p></o:p></p>

<p>(3)  <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T02">GALE

Phase 1 Chinese Broadcast Conversation Parallel Text - Part 1</a>

contains

transcripts and English translations of 20.4 hours of Chinese broadcast

conversation programming from China Central TV (CCTV) and Phoenix TV.

It does

not contain the audio files form which the transcripts and translations

were

generated. GALE Phase 1 Chinese Broadcast Conversation Parallel Text -

Part 1,

along with other corpora, was used as training data in year 1 (Phase 1)

of the

DARPA-funded GALE program. <span style="font-size: 13.5pt;"> </span><o:p></o:p></p>

<p>A total of 20.4 hours of Chinese broadcast conversation programming

were

selected from two sources: CCTV (a broadcaster from Mainland <st1:country-region><st1:place>China</st1:place></st1:country-region>),

and Phoenix TV (a <st1:place>Hong Kong</st1:place> -based satellite TV

station). The transcripts and translations represent recordings of

eight

different programs.  A manual selection procedure was used to choose

data

appropriate for the GALE program, namely, conversation (talk) programs

focusing

on current events. Stories on topics such as sports, entertainment and

business

were excluded from the data set.<o:p></o:p></p>

<p>The selected audio snippets were carefully transcribed by LDC

annotators and

professional transcription agencies following LDC's Quick Rich

Transcription

specification. Manual sentence units/segments (SU) annotation was also

performed as part of the transcription task. Three types of end of

sentence SU

were identified: statement SU, question SU, and incomplete SU.<o:p></o:p></p>

<p>After transcription and SU annotation, files were reformatted into a

human-readable translation format and assigned to professional

translators for

careful translation. Translators followed LDC's GALE Translation

guidelines

which describe the makeup of the translation team, the source data

format, the

translation data format, best practices for translating certain

linguistic

features (such as names and speech disfluencies) and quality control

procedures

applied to completed translations.<br>

</p>

<div class="MsoNormal" style="text-align: center;" align="center">

<hr align="center" size="2" width="100%"></div>

<br>

<div align="center"><font face="Courier New, Courier, monospace"><small><small><big>Ilya

Ahtaridis<br>

Membership Coordinator</big><br>

<br>

</small>--------------------------------------------------------------------</small><small><br>

</small></font></div>

<div align="center">

<pre class="moz-signature" cols="72"><font

 face="Courier New, Courier, monospace">Linguistic Data Consortium                     Phone: (215) 573-1275

University of Pennsylvania                       Fax: (215) 573-2175

3600 Market St., Suite 810                         <a

 class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>

 Philadelphia, PA 19104 USA                   <a

 class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></font></pre>

</div>

</body>

</html>