<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

</head>

<body bgcolor="#ffffff" text="#000000">

<div class="moz-text-html" lang="x-western">

<p align="center">LDC2008S02<b><br>

-  <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S02">CSLU:

National Cellular Telephone Speech Release 2.3</a>  -<br>

</b></p>

<p align="center">LDC2008T02<b><br>

</b><b>-  <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T02">GALE

Phase 1 Arabic Blog Parallel Text</a>  -<br>

</b></p>

<p align="center">LDC2008S03<b><br>

-  <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S03">STC-TIMIT

1.0</a>  -<br>

<br>

</b></p>

<p align="center">The Linguistic Data

Consortium (LDC)

would

like to announce the availability of three new publications.<br>

<br>

</p>

<hr size="2" width="100%">

<div align="center"><br>

<b>New Publications</b><br>

</div>

<br>

<p>(1)  <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S02">CSLU:

National Cellular Telephone Speech Release 2.3</a> was created by

the Center for Spoken Language Understanding (CSLU) at OGI School of

Science and Engineering, Oregon Health and Science University,

Beaverton, Oregon. It consists of cellular telephone speech and

corresponding transcripts, specifically, approximately one minute of

speech from 2336 speakers calling from locations throughout the United

States.  </p>

<p>Speakers called the CSLU data collection system on cellular

telephones, and they were asked a series of questions. Two prompt

protocols were used: an In Vehicle Protocol for speakers calling from

inside a vehicle and a Not in Vehicle Protocol for those calling from

outside a vehicle. The protocols shared several questions, but each

protocol contained distinct queries designed to probe the conditions of

the caller's in vehicle/not in vehicle surroundings. </p>

<p>The text transcriptions in this corpus were produced using the non

time-aligned word-level conventions described in The CSLU Labeling

Guide, which is included in the documentation for this release. CSLU:

National Cellular Telephone Speech Release 2.3 contains orthographic

and phonetic transcriptions of corresponding speech files. <br>

</p>

<br>

<p align="center"><b>*</b><br>

</p>

<p>(2)  <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T02">GALE

Phase 1 Arabic Blog Parallel Text</a> was prepared by the LDC and

consists of 102K words (222 files) of Arabic blog text and its English

translation from thirty-three sources. This release was used as

training data in Phase 1 of the DARPA-funded GALE program. </p>

<p>The task of preparing this corpus involved four stages of work: data

scouting, data harvesting, formatting, and data selection.</p>

Data scouting involved manually searching the web for suitable blog

text. Data scouts were assigned particular topics and genres along with

a production target in order to focus their web search. Formal

annotation guidelines and a customized annotation toolkit helped data

scouts to manage the search process and to track progress.

<p>Data scouts logged their decisions about potential text of interest

(sites, threads and posts) to a database. A nightly process queried the

annotation database and harvested all designated URLs. Whenever

possible, the entire site was downloaded, not just the individual

thread or post located by the data scout. </p>

<p>Once the text was downloaded, its format was standardized so that

the data could be more easily integrated into downstream annotation

processes. Typically a new script was required for each new domain name

that was identified. After scripts were run, an optional manual process

corrected any remaining formatting problems.</p>

<p>The selected documents were then reviewed for content suitability

using a semi-automatic process. A statistical approach was used to rank

a document's relevance to a set of already-selected documents labeled

as "good." An annotator then reviewed the list of relevance-ranked

documents and selected those which were suitable for a particular

annotation task or for annotation in general. <br>

</p>

<p>After files were selected, they were reformatted into a

human-readable translation format, and the files were then assigned to

professional translators for careful translation. Translators followed

LDC's GALE Translation guidelines, which describe the makeup of the

translation team, the source, data format, the translation data format,

best practices for translating certain linguistic features (such as

names and speech disfluencies), and quality control procedures applied

to completed translations. </p>

<p>All final data are in Tab Delimited Format (TDF). TDF is compatible

with other transcription formats, such as the Transcriber format and AG

format, and it is easy to process.  Each line of a TDF file corresponds

to a speech segment and contains 13 tab delimited field.A source TDF

file and its translation are the same except that the transcript in the

source TDF is replaced by its English translation.  <br>

</p>

<p align="center"><b>*</b><br>

</p>

<p>(3)  <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S03">STC-TIMIT

1.0</a> is a telephone version of <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC93S1">TIMIT

Acoustic Phonetic Continuous Speech Corpus, LDC93S1</a> (TIMIT). TIMIT

contains broadband recordings of 630 speakers of eight major dialects

of American English reading ten phonetically rich sentences. Created in

1993, TIMIT was designed to provide speech data for acoustic-phonetic

studies and for the development and evaluation of automatic speech

recognition systems. In this TIMIT-derived corpus, the entire TIMIT

database was passed

through an actual telephone channel in a single call. Thus, a single

type of channel distortion and noise affect the whole database.</p>

<p>The process was managed using a Dialogic switchboard for the calling

and receiving ends. No transducer (microphone) was employed; the

original digital signal was converted to analog using the switchboard's

A/D converter, transmitted trough a telephone channel and converted

back to digital format before recording. As a result, the only

distortion introduced is that of the telephone channel itself. </p>

<p>The STC-TIMIT 1.0 database is organized in the same manner as in the

original TIMIT corpus: 4620 files belonging to the training partition

and 1680 files belonging to the test partition. Utterances in STC-TIMIT

1.0 are time-aligned with those of TIMIT

with an average precision of 0.125 ms (1 sample), by maximizing the

cross-correlation between pairs of files from each corpus. Thus, labels

from TIMIT may be used for STC-TIMIT 1.0, and the effects of telephone

channels may be studied on a frame-by-frame basis.</p>

<p>Two telephone lines within the same building were connected to a

Dialogic(R) card. One of the lines was used as the calling-end and

played the speech file, while the other line was used as the

receiving-end and recorded the new signal. The whole recording process

was conducted in a single call. <br>

</p>

<p>After recording, the file was pre-cut according to the length of the

corresponding TIMIT database file. Each resulting file was then aligned

to its corresponding file in TIMIT using the xcorr routine in

Matlab(R). Based on these results, the recorded file was sliced again

from the original recorded file using the newly-generated alignments.

Thus, each file in STC-TIMIT 1.0 is aligned to its equivalent in TIMIT

and has the same length.  <br>

</p>

<hr size="2" width="100%"><font face="Courier New, Courier, monospace"></font><br>

</div>

<pre class="moz-signature" cols="72">-- 

Ilya Ahtaridis

Membership Coordinator

--------------------------------------------------------------------

Linguistic Data Consortium                  Phone: 1 (215) 573-1275

University of Pennsylvania                    Fax: 1 (215) 573-2175

3600 Market St., Suite 810                        <a class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>

Philadelphia, PA 19104 USA                 <a class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a>

</pre>

</body>

</html>