<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

</head>

<body bgcolor="#ffffff" text="#000000">

<div align="center"><i>New

Publications:</i></div>

<p class="MsoNormal" style="text-align: center;" align="center">LDC2010T08<b><br>

- <a href="#atb">Arabic

Treebank: Part 3 v 3.2</a></b><b> -</b><br>

<br>

LDC2010T06<br>

<b>- <a href="#web">Chinese

Web 5-gram Version 1</a></b><b> -</b></p>

<p class="MsoNormal" style="text-align: center;" align="center"><o:p></o:p></p>

<div class="MsoNormal" style="text-align: center;" align="center">

<hr align="center" size="2" width="100%"></div>

<div align="center"><b>New

Publications<br>

<br>

</b><o:p></o:p></div>

<p>(1)  <a name="atb"></a><a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T08">Arabic

Treebank: Part 3 v 3.2</a> consists of 599 distinct newswire stories

from the Lebanese publication An Nahar with part-of-speech (POS),

morphology,

gloss and syntactic treebank annotation in accordance with the <a

 href="http://projects.ldc.upenn.edu/ArabicTreebank/">Penn Arabic

Treebank

(PATB) Guidelines</a> developed in 2008 and 2009. This release

represents a

significant revision of LDC's previous ATB3 publications: <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T11">Arabic

Treebank: Part 3 v 1.0 LDC2004T11</a> and <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T20">Arabic

Treebank: Part 3 (full corpus) v 2.0 (MPG + Syntactic Analysis

LDC2005T20</a>. <o:p></o:p></p>

<p>ATB3 v 3.2 contains a total of 339,710 tokens before clitics are

split, and

402,291 tokens after clitics are separated for the treebank annotation.

This

release includes all files that were previously made available to the <a

 href="http://projects.ldc.upenn.edu/gale/index.html">DARPA GALE program</a>

community (Arabic Treebank Part 3 - Version 3.1, LDC2008E22). A number

of

inconsistencies in the 3.1 release data have been corrected here. These

include

changes to certain POS tags with the resulting tree changes. As a

result,

additional clitics have been separated, and some previously incorrectly

split

tokens have now been merged.<o:p></o:p></p>

<p>One file from ATB3 v 2.0, ANN20020715.0063, has been removed from

this

corpus as that text is an exact duplicate of another file in this

release

(ANN20020715.0018). This reduces the number of files from 600 files in

ATB3 v

2.0 to 599 files in ATB 3 v 3.2.<o:p></o:p></p>

<p class="MsoNormal" style="margin-bottom: 12pt;"><br>

</p>

<p class="MsoNormal" style="margin-bottom: 12pt;">[<a href="#top">

top </a>]</p>

<p class="MsoNormal" style="margin-bottom: 12pt;"><br>

<o:p></o:p></p>

<p class="MsoNormal" style="text-align: center;" align="center">*<o:p></o:p></p>

<p class="MsoNormal"><br>

(2) <a name="web"></a><a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T06">Chinese

Web 5-gram Version 1</a><b> </b>contains Chinese word n-grams and

their

observed frequency counts. The length of the n-grams ranges from

unigrams

(single words) to 5-grams. This data should be useful for statistical

language

modeling (e.g., for segmentation, machine translation), as well as for

other

uses.  Included with this publication is a simple segmenter written in

Perl using the same algorithm used to generate the data. <o:p></o:p></p>

<p>N-gram counts were generated from approximately 883 billion word

tokens of

text from publicly accessible web pages. While the aim was to identify

and

collect only Chinese language pages, some text from other languages is

incidentally included in the final data.  Data collection took place in

March 2008. This means that no text that was created on or after <st1:date

 year="2008" day="1" month="4">April 1, 2008</st1:date> was used. <o:p></o:p></p>

<p>The input character encoding of documents was automatically

detected, and

all text was converted to UTF-8. The data are tokenized by an automatic

tool,

and all continuous Chinese character sequences are sent to the

segmenter for segmentation. <o:p></o:p></p>

<p>The following types of tokens are considered valid: <o:p></o:p></p>

<ul type="disc">

  <li class="MsoNormal" style="">A Chinese word containing only Chinese

characters. <o:p></o:p></li>

  <li class="MsoNormal" style="">Numbers, e.g., 198, 2,200, 2.3, etc. <o:p></o:p></li>

  <li class="MsoNormal" style="">Single Latin tokens, such as Google,

& ab, etc. <span style=""> </span><o:p></o:p></li>

</ul>

<br>

<br>

[<a href="#top">

top </a>]<br>

<br>

<hr size="2" width="100%">

<div align="center"><font face="Courier New, Courier, monospace"><small><small><big><br>

</big></small></small></font><br>

</div>

<div align="center">

<pre class="moz-signature" cols="72"><big><font

 face="Courier New, Courier, monospace"><small><small><big>Ilya Ahtaridis</big></small></small></font>

<font face="Courier New, Courier, monospace"><small><small><big>Membership Coordinator</big></small></small></font></big>

<font face="Courier New, Courier, monospace"><small>--------------------------------------------------------------------</small></font>

<font face="Courier New, Courier, monospace">Linguistic Data Consortium                     Phone: (215) 573-1275

University of Pennsylvania                       Fax: (215) 573-2175

3600 Market St., Suite 810                         <a

 class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>

Philadelphia, PA 19104 USA                  <a

 class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></font></pre>

</div>

<pre class="moz-signature" cols="72">

</pre>

</body>

</html>