<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
</head>
<body bgcolor="#ffffff" text="#000000">
<div align="center"><b>- Collaboration between LDC and Georgetown
University Press -<br>
<br>
</b><b>LDC2008S06</b><br>
<b>- <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S06">CSLU:
Alphadigit Version 1.3</a> -</b><br>
<br>
<b>LDC2008T08</b><br>
<b>- <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T08">GALE
Phase 1 Chinese Broadcast News Parallel Text - Part 2</a> -</b><br>
<br>
<b>The Linguistic Data Consortium (LDC) would
like to report on recent developments and announce the availability of
two new publications.<br>
<br>
</b>
<hr size="2" width="100%"><b><br>
Collaboration between LDC and </b><st1:place><st1:placename><b>Georgetown</b></st1:placename><b>
</b><st1:placetype><b>University</b></st1:placetype></st1:place><b>
Press<br>
</b>
<br>
<p class="MsoNormal" style="" align="left">LDC
is pleased to announce that the <a href="http://www.ed.gov/index.jhtml">U.S.
Department of Education</a>, <a
href="http://www.ed.gov/about/offices/list/ope/iegps/index.html">International
Education Programs Service</a>, has funded a collaboration between LDC
and <a href="http://www.press.georgetown.edu/">Georgetown University
Press</a> (GUP)
to create up-to-date lexical databases, with translations to and from
English,
for three dialects of colloquial Arabic. The databases will be used for
interactive computer access and for new print publications of
dictionaries in
Iraqi, Syrian/Levantine and Moroccan dialects. <o:p></o:p></p>
<div align="left"></div>
<p class="MsoNormal" style="" align="left">The
databases will be based on three GUP source dictionaries: <i>A
Dictionary of Iraqi
Arabic, English-Arabic, Arabic-English </i>(Clarity, et al., 2003), <i>A
Dictionary of Syrian Arabic, English-Arabic</i> (Stowasser and Ani,
2004) and a
<i>Dictionary of Moroccan Arabic, Arabic-English, English-Arabic</i>
(Harrell
and Sobelman, 2004). Utilizing contemporary principles of computational
linguistics and current pedagogical requirements in order to reflect
current
vocabulary and usage, the work will provide a standardized system of
transcription and use the Arabic script, both vocalized and
unvocalized, to
show vowel pronunciation as well as standard orthography. A searchable
version
on CD-ROM will accompany each print reference. The project has been
funded for
three years. Work will commence in Year 1 with the Iraqi Arabic
dictionary,
proceed to the Syrian/Levantine dictionary and conclude with the
Moroccan
Arabic dictionary. <o:p></o:p></p>
<div align="left"></div>
<p class="MsoNormal" style="margin-bottom: 12pt;" align="left">The
proposed dictionaries and databases aim to provide <st1:country-region><st1:place>U.S.</st1:place></st1:country-region>
students and teachers of Arabic with current dialectal Arabic lexical
information to enable them to communicate orally with native and
non-native
Arabic speakers. The scholarship used to create a modernized
transcription
system and to provide existing and new terms in Arabic script
(including
diacritics) may also help integrate instruction in dialect and Modern
Standard
Arabic by providing tools for curriculum developers.<br>
<br>
</p>
<p class="MsoNormal" style="margin-bottom: 12pt; text-align: center;"
align="center"><b>New Publications<br>
</b></p>
<p class="MsoNormal" style="margin-bottom: 12pt; text-align: center;"
align="center"><br>
<o:p></o:p></p>
<p align="left">(1) <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S06">CSLU:
Alphadigit Version 1.3</a> <span style=""></span> is a collection of
78,044 utterances from 3,025 speakers saying six-digit strings of
letters and digits over the telephone for a total of approximately 82
hours of speech. Each speech file has corresponding orthographic and
phonemic transcriptions. This corpus was created by the Center for
Spoken Language Understanding (CSLU), Oregon Health & Science
University, Beaverton, Oregon.</p>
<p align="left">Participants received a list of 18-29 six-digit strings
(e.g., "a 2
b 4 5 g"); 1102 different strings were used throughout the course of
the data collection. The lists were set up to balance for phonetic
context between all letter and digit pairs. The data were recorded
directly from a digital phone line without
digital-to-analog or analog-to-digital conversion at the recording end
using the CSLU T1 digital data collection system. The sampling rate was
8khz and the files were stored in 8-bit mu-law format on a UNIX file
system. The files have been converted to RIFF standard file format,
16-bit linearly encoded.</p>
<p style="margin-bottom: 12pt;" align="left">All of the files included
in this corpus have corresponding
non-time-aligned word-level transcriptions and time aligned
phoneme-level transcriptions (automatic forced alignment) that comply
with the conventions in the CSLU Labeling Guide. <o:p></o:p></p>
<p style="margin-bottom: 12pt; text-align: center;" align="left"><b>*</b><o:p></o:p></p>
<p class="MsoNormal" align="left">(2) <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T08">GALE
Phase 1 Chinese Broadcast News Parallel Text - Part 2</a> contains
transcripts
and English translations of 22.9 hours of Chinese broadcast news
programming
from China Central TV (CCTV) and Phoenix TV. It does not contain the
audio
files from which the transcripts and translations were generated. GALE
Phase 1
Chinese Broadcast News Parallel Text - Part 2 is the second of the
three-part
GALE Phase 1 Chinese Broadcast News Parallel Text, which, along with
other
corpora, was used as training data in year 1 (Phase 1) of the
DARPA-funded GALE
program. <o:p></o:p></p>
<p align="left">A total of 22.9 hours of Chinese broadcast news
recordings were
selected
from two sources, CCTV (a broadcaster from Mainland <st1:country-region><st1:place>China</st1:place></st1:country-region>)
and Phoenix TV (a <st1:place>Hong Kong</st1:place> based satellite TV
station).
The transcripts and translations represent recordings of five different
programs.<o:p></o:p></p>
<p align="left">A manual selection procedure was used to choose data
appropriate for
the
GALE program, namely, news programs focusing on current events. Stories
on
topics such as sports, entertainment and stock markets were excluded
from the
data set. Manual sentence units/segments (SU) annotation was also
performed on a subset of files following LDC's Quick Rich Transcription
specification. Three types of end of sentence SU were identified:
statement SU,
question SU, and incomplete SU. After transcription and SU annotation,
they
were reformatted into a human-readable translation format, and the
files were
then assigned to professional translators for careful translation.
Translators
followed LDC's GALE Translation guidelines, which describe the makeup
of the
translation team, the source, data format, the translation data format,
best
practices for translating certain linguistic features (such as names
and speech
disfluencies), and quality control procedures applied to completed
translations. <br>
</p>
<br>
<hr size="2" width="100%">
<div align="center"><font face="Courier New, Courier, monospace"><small><small><big><br>
Ilya
Ahtaridis<br>
Membership Coordinator</big><br>
<br>
</small>--------------------------------------------------------------------</small><small><br>
</small></font></div>
<div align="center">
<pre class="moz-signature" cols="72"><font
face="Courier New, Courier, monospace">Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 <a
class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>
Philadelphia, PA 19104 USA <a
class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></font></pre>
</div>
<br>
</div>
<b><br>
</b>
</body>
</html>