[Corpora-List] New from LDC

Thu Apr 29 15:42:27 UTC 2010

/New Publications:/

LDC2010T08*
- Arabic Treebank: Part 3 v 3.2 <#atb>** -*

LDC2010T06
*- Chinese Web 5-gram Version 1 <#web>** -*

------------------------------------------------------------------------
*New Publications

*

(1)  Arabic Treebank: Part 3 v 3.2 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T08> 
consists of 599 distinct newswire stories from the Lebanese publication 
An Nahar with part-of-speech (POS), morphology, gloss and syntactic 
treebank annotation in accordance with the Penn Arabic Treebank (PATB) 
Guidelines <http://projects.ldc.upenn.edu/ArabicTreebank/> developed in 
2008 and 2009. This release represents a significant revision of LDC's 
previous ATB3 publications: Arabic Treebank: Part 3 v 1.0 LDC2004T11 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T11> 
and Arabic Treebank: Part 3 (full corpus) v 2.0 (MPG + Syntactic 
Analysis LDC2005T20 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T20>.

ATB3 v 3.2 contains a total of 339,710 tokens before clitics are split, 
and 402,291 tokens after clitics are separated for the treebank 
annotation. This release includes all files that were previously made 
available to the DARPA GALE program 
<http://projects.ldc.upenn.edu/gale/index.html> community (Arabic 
Treebank Part 3 - Version 3.1, LDC2008E22). A number of inconsistencies 
in the 3.1 release data have been corrected here. These include changes 
to certain POS tags with the resulting tree changes. As a result, 
additional clitics have been separated, and some previously incorrectly 
split tokens have now been merged.

One file from ATB3 v 2.0, ANN20020715.0063, has been removed from this 
corpus as that text is an exact duplicate of another file in this 
release (ANN20020715.0018). This reduces the number of files from 600 
files in ATB3 v 2.0 to 599 files in ATB 3 v 3.2.

[ top <#top>]

*

(2) Chinese Web 5-gram Version 1 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T06>* 
*contains Chinese word n-grams and their observed frequency counts. The 
length of the n-grams ranges from unigrams (single words) to 5-grams. 
This data should be useful for statistical language modeling (e.g., for 
segmentation, machine translation), as well as for other uses.  Included 
with this publication is a simple segmenter written in Perl using the 
same algorithm used to generate the data.

N-gram counts were generated from approximately 883 billion word tokens 
of text from publicly accessible web pages. While the aim was to 
identify and collect only Chinese language pages, some text from other 
languages is incidentally included in the final data.  Data collection 
took place in March 2008. This means that no text that was created on or 
after April 1, 2008 was used.

The input character encoding of documents was automatically detected, 
and all text was converted to UTF-8. The data are tokenized by an 
automatic tool, and all continuous Chinese character sequences are sent 
to the segmenter for segmentation.

The following types of tokens are considered valid:

    * A Chinese word containing only Chinese characters.
    * Numbers, e.g., 198, 2,200, 2.3, etc.
    * Single Latin tokens, such as Google, & ab, etc.  

[ top <#top>]

------------------------------------------------------------------------

Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium                     Phone: (215) 573-1275
University of Pennsylvania                       Fax: (215) 573-2175
3600 Market St., Suite 810                         ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA                  http://www.ldc.upenn.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20100429/8e570de6/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora