Arabic-L:LING:LDC Arabic Treebank

Dilworth Parkinson dil at BYU.EDU
Fri May 7 11:49:34 UTC 2010


------------------------------------------------------------------------
Arabic-L: Fri 07 Aug 2010
Moderator: Dilworth Parkinson <dilworth_parkinson at byu.edu>
[To post messages to the list, send them to arabic-l at byu.edu]
[To unsubscribe, send message from same address you subscribed from to
listserv at byu.edu with first line reading:
            unsubscribe arabic-l                                      ]

-------------------------Directory------------------------------------

1) Subject:LDC Arabic Treebank

-------------------------Messages-----------------------------------
1)
Date: 07 Aug 2010
From:Linguistic Data Consortium <ldc at ldc.upenn.edu>
Subject:LDC Arabic Treebank

(1)  Arabic Treebank: Part 3 v 3.2 consists of 599 distinct newswire stories from the Lebanese publication An Nahar with part-of-speech (POS), morphology, gloss and syntactic treebank annotation in accordance with the Penn Arabic Treebank (PATB) Guidelines developed in 2008 and 2009. This release represents a significant revision of LDC's previous ATB3 publications: Arabic Treebank: Part 3 v 1.0 LDC2004T11 and Arabic Treebank: Part 3 (full corpus) v 2.0 (MPG + Syntactic Analysis LDC2005T20.

ATB3 v 3.2 contains a total of 339,710 tokens before clitics are split, and 402,291 tokens after clitics are separated for the treebank annotation. This release includes all files that were previously made available to the DARPA GALE program community (Arabic Treebank Part 3 - Version 3.1, LDC2008E22). A number of inconsistencies in the 3.1 release data have been corrected here. These include changes to certain POS tags with the resulting tree changes. As a result, additional clitics have been separated, and some previously incorrectly split tokens have now been merged.

One file from ATB3 v 2.0, ANN20020715.0063, has been removed from this corpus as that text is an exact duplicate of another file in this release (ANN20020715.0018). This reduces the number of files from 600 files in ATB3 v 2.0 to 599 files in ATB 3 v 3.2.

Arabic Treebank: Part 3 v 3.2 is distributed on one CD-ROM.

2010 Subscription Members will automatically receive two copies of this corpus.  2010 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$4500.


--------------------------------------------------------------------------
End of Arabic-L:  07 Aug 2010


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/arabic-l/attachments/20100507/c9466218/attachment.htm>


More information about the Arabic-l mailing list