[Corpora-List] New from LDC
Linguistic Data Consortium
ldc at ldc.upenn.edu
Thu Apr 29 15:42:27 UTC 2010
/New Publications:/
LDC2010T08*
- Arabic Treebank: Part 3 v 3.2 <#atb>** -*
LDC2010T06
*- Chinese Web 5-gram Version 1 <#web>** -*
------------------------------------------------------------------------
*New Publications
*
(1) Arabic Treebank: Part 3 v 3.2
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T08>
consists of 599 distinct newswire stories from the Lebanese publication
An Nahar with part-of-speech (POS), morphology, gloss and syntactic
treebank annotation in accordance with the Penn Arabic Treebank (PATB)
Guidelines <http://projects.ldc.upenn.edu/ArabicTreebank/> developed in
2008 and 2009. This release represents a significant revision of LDC's
previous ATB3 publications: Arabic Treebank: Part 3 v 1.0 LDC2004T11
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T11>
and Arabic Treebank: Part 3 (full corpus) v 2.0 (MPG + Syntactic
Analysis LDC2005T20
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T20>.
ATB3 v 3.2 contains a total of 339,710 tokens before clitics are split,
and 402,291 tokens after clitics are separated for the treebank
annotation. This release includes all files that were previously made
available to the DARPA GALE program
<http://projects.ldc.upenn.edu/gale/index.html> community (Arabic
Treebank Part 3 - Version 3.1, LDC2008E22). A number of inconsistencies
in the 3.1 release data have been corrected here. These include changes
to certain POS tags with the resulting tree changes. As a result,
additional clitics have been separated, and some previously incorrectly
split tokens have now been merged.
One file from ATB3 v 2.0, ANN20020715.0063, has been removed from this
corpus as that text is an exact duplicate of another file in this
release (ANN20020715.0018). This reduces the number of files from 600
files in ATB3 v 2.0 to 599 files in ATB 3 v 3.2.
[ top <#top>]
*
(2) Chinese Web 5-gram Version 1
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T06>*
*contains Chinese word n-grams and their observed frequency counts. The
length of the n-grams ranges from unigrams (single words) to 5-grams.
This data should be useful for statistical language modeling (e.g., for
segmentation, machine translation), as well as for other uses. Included
with this publication is a simple segmenter written in Perl using the
same algorithm used to generate the data.
N-gram counts were generated from approximately 883 billion word tokens
of text from publicly accessible web pages. While the aim was to
identify and collect only Chinese language pages, some text from other
languages is incidentally included in the final data. Data collection
took place in March 2008. This means that no text that was created on or
after April 1, 2008 was used.
The input character encoding of documents was automatically detected,
and all text was converted to UTF-8. The data are tokenized by an
automatic tool, and all continuous Chinese character sequences are sent
to the segmenter for segmentation.
The following types of tokens are considered valid:
* A Chinese word containing only Chinese characters.
* Numbers, e.g., 198, 2,200, 2.3, etc.
* Single Latin tokens, such as Google, & ab, etc.
[ top <#top>]
------------------------------------------------------------------------
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA http://www.ldc.upenn.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20100429/8e570de6/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list