<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

</head>

<body bgcolor="#ffffff" text="#000000">

<div align="center"><b>-  Collaboration between LDC and Georgetown

University Press  -<br>

<br>

</b><b>LDC2008S06</b><br>

<b>-  <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S06">CSLU:

Alphadigit Version 1.3</a>  -</b><br>

<br>

<b>LDC2008T08</b><br>

<b>-  <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T08">GALE

Phase 1 Chinese Broadcast News Parallel Text - Part 2</a>  -</b><br>

<br>

<b>The Linguistic Data Consortium (LDC) would

like to report on recent developments and announce the availability of

two new publications.<br>

<br>

</b>

<hr size="2" width="100%"><b><br>

Collaboration between LDC and </b><st1:place><st1:placename><b>Georgetown</b></st1:placename><b>

</b><st1:placetype><b>University</b></st1:placetype></st1:place><b>

Press<br>

</b>

<br>

<p class="MsoNormal" style="" align="left">LDC

is pleased to announce that the <a href="http://www.ed.gov/index.jhtml">U.S.

Department of Education</a>, <a

 href="http://www.ed.gov/about/offices/list/ope/iegps/index.html">International

Education Programs Service</a>, has funded a collaboration between LDC

and <a href="http://www.press.georgetown.edu/">Georgetown University

Press</a> (GUP)

to create up-to-date lexical databases, with translations to and from

English,

for three dialects of colloquial Arabic. The databases will be used for

interactive computer access and for new print publications of

dictionaries in

Iraqi, Syrian/Levantine and Moroccan dialects.  <o:p></o:p></p>

<div align="left"></div>

<p class="MsoNormal" style="" align="left">The

databases will be based on three GUP source dictionaries: <i>A

Dictionary of Iraqi

Arabic, English-Arabic, Arabic-English </i>(Clarity, et al., 2003), <i>A

Dictionary of Syrian Arabic, English-Arabic</i> (Stowasser and Ani,

2004) and a

<i>Dictionary of Moroccan Arabic, Arabic-English, English-Arabic</i>

(Harrell

and Sobelman, 2004). Utilizing contemporary principles of computational

linguistics and current pedagogical requirements in order to reflect

current

vocabulary and usage, the work will provide a standardized system of

transcription and use the Arabic script, both vocalized and

unvocalized, to

show vowel pronunciation as well as standard orthography. A searchable

version

on CD-ROM will accompany each print reference. The project has been

funded for

three years. Work will commence in Year 1 with the Iraqi Arabic

dictionary,

proceed to the Syrian/Levantine dictionary and conclude with the

Moroccan

Arabic dictionary. <o:p></o:p></p>

<div align="left"></div>

<p class="MsoNormal" style="margin-bottom: 12pt;" align="left">The

proposed dictionaries and databases aim to provide <st1:country-region><st1:place>U.S.</st1:place></st1:country-region>

students and teachers of Arabic with current dialectal Arabic lexical

information to enable them to communicate orally with native and

non-native

Arabic speakers. The scholarship used to create a modernized

transcription

system and to provide existing and new terms in Arabic script

(including

diacritics) may also help integrate instruction in dialect and Modern

Standard

Arabic by providing tools for curriculum developers.<br>

<br>

</p>

<p class="MsoNormal" style="margin-bottom: 12pt; text-align: center;"

 align="center"><b>New Publications<br>

</b></p>

<p class="MsoNormal" style="margin-bottom: 12pt; text-align: center;"

 align="center"><br>

<o:p></o:p></p>

<p align="left">(1) <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S06">CSLU: 

Alphadigit Version 1.3</a> <span style=""></span> is a collection of

78,044 utterances from 3,025 speakers saying six-digit strings of

letters and digits over the telephone for a total of approximately 82

hours of speech. Each speech file has corresponding orthographic and

phonemic transcriptions. This corpus was created by the Center for

Spoken Language Understanding (CSLU), Oregon Health & Science

University, Beaverton, Oregon.</p>

<p align="left">Participants received a list of 18-29 six-digit strings

(e.g., "a 2

b 4 5 g"); 1102 different strings were used throughout the course of

the data collection. The lists were set up to balance for phonetic

context between all letter and digit pairs. The data were recorded

directly from a digital phone line without

digital-to-analog or analog-to-digital conversion at the recording end

using the CSLU T1 digital data collection system. The sampling rate was

8khz and the files were stored in 8-bit mu-law format on a UNIX file

system. The files have been converted to RIFF standard file format,

16-bit linearly encoded.</p>

<p style="margin-bottom: 12pt;" align="left">All of the files included

in this corpus have corresponding

non-time-aligned word-level transcriptions and time aligned

phoneme-level transcriptions (automatic forced alignment) that comply

with the conventions in the CSLU Labeling Guide.   <o:p></o:p></p>

<p style="margin-bottom: 12pt; text-align: center;" align="left"><b>*</b><o:p></o:p></p>

<p class="MsoNormal" align="left">(2) <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T08">GALE

Phase 1 Chinese Broadcast News Parallel Text - Part 2</a> contains

transcripts

and English translations of 22.9 hours of Chinese broadcast news

programming

from China Central TV (CCTV) and Phoenix TV. It does not contain the

audio

files from which the transcripts and translations were generated. GALE

Phase 1

Chinese Broadcast News Parallel Text - Part 2 is the second of the

three-part

GALE Phase 1 Chinese Broadcast News Parallel Text, which, along with

other

corpora, was used as training data in year 1 (Phase 1) of the

DARPA-funded GALE

program.  <o:p></o:p></p>

<p align="left">A total of 22.9 hours of Chinese broadcast news

recordings were

selected

from two sources, CCTV (a broadcaster from Mainland <st1:country-region><st1:place>China</st1:place></st1:country-region>)

and Phoenix TV (a <st1:place>Hong Kong</st1:place> based satellite TV

station).

The transcripts and translations represent recordings of five different

programs.<o:p></o:p></p>

<p align="left">A manual selection procedure was used to choose data

appropriate for

the

GALE program, namely, news programs focusing on current events. Stories

on

topics such as sports, entertainment and stock markets were excluded

from the

data set.  Manual sentence units/segments (SU) annotation was also

performed on a subset of files following LDC's Quick Rich Transcription

specification. Three types of end of sentence SU were identified:

statement SU,

question SU, and incomplete SU. After transcription and SU annotation,

they

were reformatted into a human-readable translation format, and the

files were

then assigned to professional translators for careful translation.

Translators

followed LDC's GALE Translation guidelines, which describe the makeup

of the

translation team, the source, data format, the translation data format,

best

practices for translating certain linguistic features (such as names

and speech

disfluencies), and quality control procedures applied to completed

translations.  <br>

</p>

<br>

<hr size="2" width="100%">

<div align="center"><font face="Courier New, Courier, monospace"><small><small><big><br>

Ilya

Ahtaridis<br>

Membership Coordinator</big><br>

<br>

</small>--------------------------------------------------------------------</small><small><br>

</small></font></div>

<div align="center">

<pre class="moz-signature" cols="72"><font

 face="Courier New, Courier, monospace">Linguistic Data Consortium                     Phone: (215) 573-1275

University of Pennsylvania                       Fax: (215) 573-2175

3600 Market St., Suite 810                         <a

 class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>

 Philadelphia, PA 19104 USA                   <a

 class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></font></pre>

</div>

<br>

</div>

<b><br>

</b>

</body>

</html>