Arabic-L:LING:Arabic Treebank Part 3 announced

Dilworth Parkinson dilworth_parkinson at byu.edu
Fri May 28 22:25:08 UTC 2004


------------------------------------------------------------------------ 
-
Arabic-L: Fri 28 Mar  2004
Moderator: Dilworth Parkinson <dilworth_parkinson at byu.edu>
[To post messages to the list, send them to arabic-l at byu.edu]
[To unsubscribe, send message from same address you subscribed from to
listserv at byu.edu with first line reading:
            unsubscribe arabic-l                                      ]

-------------------------Directory------------------------------------

1) Subject:Arabic Treebank Part 3 announced

-------------------------Messages-----------------------------------
1)
Date: 28 Mar 2004
From:ldc at ldc.upenn.edu (reposted from CORPORA)
Subject:Arabic Treebank Part 3 announced

LDC2004T11
*  Arabic Treebank: Part 3 v.1.0  *

The Linguistic Data Consortium (LDC) is pleased to announce the  
availability of
....
....
....
(2)  Arabic Treebank: Part 3 v 1.0 is the third part of a corpus of  
1,000,000 words of Arabic Treebank, designed to support language  
research and development of language technology for Modern Standard  
Arabic.  This corpus includes 600 stories from the An Nahar News  
Agency. There are a total of 340,281 words (counting non-Arabic tokens  
such as numbers and punctuation) in the 600 files - one story per file.  
New features of annotation include complete vocalization (including  
case endings), lemma IDs, and more specific POS tags for verbs and  
particles.

  The corpus contains 293,035 Arabic-only word tokens (prior to the  
separation of clitics), of which 290,842 (99.25%) were provided with an  
acceptable morphological analysis and POS tag by the morphological  
parser, and 2,193 (0.75%) were items that the morphological parser  
failed to analyze correctly.  Arabic Treebank: Part 3 v 1.0 is  
distributed on 1 CD.
  For further information, including online documentation, please visit:

http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T11

  Institutions that have membership in the LDC for the 2004 Membership  
Year will be able to receive this corpus free of charge.  Nonmembers  
may license this data for US$3000.


------------------------------------------------------------------------ 
--
End of Arabic-L:  28 Mar  2004



More information about the Arabic-l mailing list