<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">

  <title></title>

</head>

<body bgcolor="#ffffff" text="#000000">

<div align="center"><font face="Courier New"><small>LDC2005S08 <br>

</small></font><font face="Courier New"><small><b>BBN/AUB DARPA Babylon

Levantine

Arabic Speech and Transcripts </b></small></font><br>

<br>

<font face="Courier New"><small>LDC2005T01

</small></font><br>

<font face="Courier New"><small><b>Chinese Treebank 5.0</b>

</small></font><br>

<br>

<font face="Courier New"><small>LDC2005S07

</small></font><br>

<font face="Courier New"><small><b>Levantine Arabic QT Training Data

Set 3 Speech</b>

</small></font><br>

<br>

<font face="Courier New"><small>LDC2005T03

</small></font><br>

<font face="Courier New"><small><b>Levantine Arabic QT Training Data

Set 3 Transcripts</b>

<br>

<br>

<br>

The Linguistic Data Consortium (LDC) would like to announce the

availability of four new corpora.<br>

<br>

</small></font>

<hr size="2" width="100%"><font face="Courier New"><small><br>

</small></font></div>

<font face="Courier New"><small><br>

</small></font><font face="Courier New"><small>(1)  <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S08">BBN/AUB

DARPA Babylon Levantine Arabic Speech and Transcripts</a> consists

of transcribed, spontaneous speech, recorded from subjects speaking in

Levantine colloquial Arabic. Levantine Arabic is the dialect of Arabic

spoken by ordinary people in Lebanon, Jordan, Syria, and Palestine. It

is significantly different from Modern Standard Arabic (MSA), in that

it is a spoken rather than a written language. It includes different

word pronunciations, and even different words.

<br>

<br>

The corpus would be useful for

anyone attempting to do speech recognition in Levantine colloquial

Arabic, including for speech translation and spoken dialog systems.

BBN/AUB DARPA Babylon Levantine Arabic Speech and Transcripts is

distributed on two DVD-ROM.

<br>

<br>

<br>

<br>

(2)  <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T01">Chinese

Treebank 5.0</a> is a 500K word corpus of Chinese text with

syntactic bracketing. The corpus contains 824K Hanzi, 18K sentences,

and 890 data

files. The data is drawn from three sources: Xinhua (1994-1998),

Information Services Department of HKSAR (1997), and Sinorama magazine,

Taiwan (1996-1998 & 2000-2001)

<br>

<br>

All files are GB encoded. Chinese Treebank 5.0 provides four versions

of files: bracketed, raw, segmented and POS tagged. The raw, segmented

and POS tagged versions are generated from the bracketed version and so

do not reflect the previous annotation stages. Chinese Treebank 5.0 is

distributed on one CD-ROM.

 <br>

<br>

<br>

<br>

(3)  <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S07">Levantine

Arabic QT Training Data Set 3 Speech</a> contains 322

telephone conversations and totals about 50 hours of Levantine Arabic

speech.

Participants were instructed to speak on set topics.  Unlike the

previous training data corpora (Set 1 and 2)

which are nearly 100% Jordanian speakers, this corpus is mostly

Lebanese (72%) plus a combination of others Levantine speakers. 

Levantine Arabic QT Training Data Set 3 Speech is distributed on one

DVD-ROM.

<br>

<br>

<br>

<br>

(4)  <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T03">Levantine

Arabic QT Training Data Set 3 Transcripts</a> contains the

transcription for the Levantine Arabic QT Training Data Set 3.  There

are 322 files is UTF-8 format. The corpus also contains a word list and

speaker information files.  Levantine Arabic QT Training Data Set 3

Transcripts is distributed on

one CD-ROM.<br>

<br>

<br>

</small></font>

<hr size="2" width="100%"><font face="Courier New"><small><br>

</small></font>

<div align="center"><font face="Courier New"><small>If you need further

information, or would like to inquire about

membership to the LDC, please email <a class="moz-txt-link-abbreviated"

 href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a> or call +1 215

573 2175.<br>

<br>

<br>

</small></font></div>

<div align="center">--------------------------------------------------------------------<br>

</div>

<div align="center">

<pre class="moz-signature" cols="72">Linguistic Data Consortium                     Phone: (215) 573-1275

3600 Market Street                             Fax:   (215) 573-2175

Suite 810                                          <a

 class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>

Philadelphia, PA 19104                      <a

 class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></pre>

</div>

<br>

<pre class="moz-signature" cols="72">

</pre>

</body>

</html>