<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">

</head>

<body bgcolor="#ffffff" text="#000000">

<div class="moz-text-html" lang="x-western">

<div align="center">LDC2005T20

<br>

<a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T20">Arabic

Treebank: Part 3 (full corpus) v2.0 (MPG + Syntactic Analysis)</a>

<br>

<br>

LDC2005T10

<br>

<a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T10">Chinese

English News Magazine Parallel Text</a>

<br>

<br>

LDC2005S14

<br>

<a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S14">Levantine

Arabic QT Training Data Set 4 (Speech + Transcripts)</a>

<br>

<font face="Times New Roman, Times, serif"><br>

The Linguistic Data Consortium (LDC) is pleased to announce the

availability of three new corpora.<br>

<br>

</font>

<hr size="2" width="100%"><font face="Times New Roman, Times, serif"><br>

</font></div>

<br>

<a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T20">Arabic

Treebank: Part 3 (full corpus) v2.0 (MPG + Syntactic Analysis)</a>

supports the development of data-driven approaches to natural language

processing (NLP), human language technologies, automatic content

extraction (topic extraction and/or grammar extraction), cross-lingual

information retrieval, information detection, and other forms of

linguistic research on Modern Standard Arabic in general. The LDC was

sponsored to develop an Arabic POS and Treebank of 1,000,000 words, and

this corpus is part three of that project. In this release, both

syntactic (treebank) annotation and annotation on part of speech (POS),

gloss, and word segmentation are provided.

<br>

<br>

The current Arabic Treebank: Part 3 corpus consists of 600 stories from

the An Nahar News Agency. The new features include complete

vocalization of all Imperfect Verb mood endings: Indicative,

Subjunctive, and Jussive.

<br>

<font face="Times New Roman"><small><big><br>

<br>

</big></small></font>

<div align="center"><font face="Times New Roman"><small><big>*</big></small></font><br>

</div>

<br>

<br>

<a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T10">Chinese

English News Magazine Parallel Text</a> contains Chinese news stories

and their English translations drawn from Sinorama Magazine, Taiwan,

from 1976 to 2004. The corpus totals 6,366 story pairs, 365,568

sentence pairs, 20M Chinese characters and 9M English words. It is

aligned at sentence level; the data obtained from Sinorama Magazine was

aligned at the story

level. The sentence alignment was done at the LDC using champollion

v1.1.

The Sinorama Chinese text is encoded in Big5. <br>

<font face="Times New Roman"><small><big><br>

<br>

</big></small></font>

<div align="center"><font face="Times New Roman"><small><big>*</big></small></font><br>

</div>

<br>

<a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S14">Levantine

Arabic QT Training Data Set 4 (Speech + Transcripts)</a> contains 901

calls, totaling 133.6 hours of telephone conversation speech in

Levantine Arabic. The majority of speakers in this corpus are Lebanese.

The corpus also includes 901 transcript files is UTF-8 format. Speaker

information files are provided. <br>

<br>

<font face="Times New Roman"><small><big><br>

<br>

</big></small></font>

<font face="Times New Roman"><small><big><br>

</big></small></font>

<hr size="2" width="100%"><font face="Times New Roman"><small><big><br>

</big></small></font>

<div align="center"><font face="Courier New"><small><big><font

 face="Times New Roman"><br>

If you need further

information, or would like to inquire about

membership to the LDC, please email <a class="moz-txt-link-abbreviated"

 href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a> or call +1 215

573 1275.</font></big><br>

<br>

<br>

</small></font></div>

<div align="center">--------------------------------------------------------------------<br>

</div>

<div align="center">

<pre class="moz-signature" cols="72">Linguistic Data Consortium                     Phone: (215) 573-1275

3600 Market Street                             Fax:   (215) 573-2175

Suite 810                                          <a

 class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>

Philadelphia, PA 19104                      <a

 class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></pre>

</div>

</div>

</body>

</html>