<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
<title></title>
</head>
<body bgcolor="#ffffff" text="#000000">
<div align="center"><font face="Courier New"><small>LDC2005S08 <br>
</small></font><font face="Courier New"><small><b>BBN/AUB DARPA Babylon
Levantine
Arabic Speech and Transcripts </b></small></font><br>
<br>
<font face="Courier New"><small>LDC2005T01
</small></font><br>
<font face="Courier New"><small><b>Chinese Treebank 5.0</b>
</small></font><br>
<br>
<font face="Courier New"><small>LDC2005S07
</small></font><br>
<font face="Courier New"><small><b>Levantine Arabic QT Training Data
Set 3 Speech</b>
</small></font><br>
<br>
<font face="Courier New"><small>LDC2005T03
</small></font><br>
<font face="Courier New"><small><b>Levantine Arabic QT Training Data
Set 3 Transcripts</b>
<br>
<br>
<br>
The Linguistic Data Consortium (LDC) would like to announce the
availability of four new corpora.<br>
<br>
</small></font>
<hr size="2" width="100%"><font face="Courier New"><small><br>
</small></font></div>
<font face="Courier New"><small><br>
</small></font><font face="Courier New"><small>(1) <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S08">BBN/AUB
DARPA Babylon Levantine Arabic Speech and Transcripts</a> consists
of transcribed, spontaneous speech, recorded from subjects speaking in
Levantine colloquial Arabic. Levantine Arabic is the dialect of Arabic
spoken by ordinary people in Lebanon, Jordan, Syria, and Palestine. It
is significantly different from Modern Standard Arabic (MSA), in that
it is a spoken rather than a written language. It includes different
word pronunciations, and even different words.
<br>
<br>
The corpus would be useful for
anyone attempting to do speech recognition in Levantine colloquial
Arabic, including for speech translation and spoken dialog systems.
BBN/AUB DARPA Babylon Levantine Arabic Speech and Transcripts is
distributed on two DVD-ROM.
<br>
<br>
<br>
<br>
(2) <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T01">Chinese
Treebank 5.0</a> is a 500K word corpus of Chinese text with
syntactic bracketing. The corpus contains 824K Hanzi, 18K sentences,
and 890 data
files. The data is drawn from three sources: Xinhua (1994-1998),
Information Services Department of HKSAR (1997), and Sinorama magazine,
Taiwan (1996-1998 & 2000-2001)
<br>
<br>
All files are GB encoded. Chinese Treebank 5.0 provides four versions
of files: bracketed, raw, segmented and POS tagged. The raw, segmented
and POS tagged versions are generated from the bracketed version and so
do not reflect the previous annotation stages. Chinese Treebank 5.0 is
distributed on one CD-ROM.
<br>
<br>
<br>
<br>
(3) <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S07">Levantine
Arabic QT Training Data Set 3 Speech</a> contains 322
telephone conversations and totals about 50 hours of Levantine Arabic
speech.
Participants were instructed to speak on set topics. Unlike the
previous training data corpora (Set 1 and 2)
which are nearly 100% Jordanian speakers, this corpus is mostly
Lebanese (72%) plus a combination of others Levantine speakers.
Levantine Arabic QT Training Data Set 3 Speech is distributed on one
DVD-ROM.
<br>
<br>
<br>
<br>
(4) <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T03">Levantine
Arabic QT Training Data Set 3 Transcripts</a> contains the
transcription for the Levantine Arabic QT Training Data Set 3. There
are 322 files is UTF-8 format. The corpus also contains a word list and
speaker information files. Levantine Arabic QT Training Data Set 3
Transcripts is distributed on
one CD-ROM.<br>
<br>
<br>
</small></font>
<hr size="2" width="100%"><font face="Courier New"><small><br>
</small></font>
<div align="center"><font face="Courier New"><small>If you need further
information, or would like to inquire about
membership to the LDC, please email <a class="moz-txt-link-abbreviated"
href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a> or call +1 215
573 2175.<br>
<br>
<br>
</small></font></div>
<div align="center">--------------------------------------------------------------------<br>
</div>
<div align="center">
<pre class="moz-signature" cols="72">Linguistic Data Consortium Phone: (215) 573-1275
3600 Market Street Fax: (215) 573-2175
Suite 810 <a
class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>
Philadelphia, PA 19104 <a
class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></pre>
</div>
<br>
<pre class="moz-signature" cols="72">
</pre>
</body>
</html>