Arabic-L:LING:Arabic Treebank Part 3 announced
Dilworth Parkinson
dilworth_parkinson at byu.edu
Fri May 28 22:25:08 UTC 2004
------------------------------------------------------------------------
-
Arabic-L: Fri 28 Mar 2004
Moderator: Dilworth Parkinson <dilworth_parkinson at byu.edu>
[To post messages to the list, send them to arabic-l at byu.edu]
[To unsubscribe, send message from same address you subscribed from to
listserv at byu.edu with first line reading:
unsubscribe arabic-l ]
-------------------------Directory------------------------------------
1) Subject:Arabic Treebank Part 3 announced
-------------------------Messages-----------------------------------
1)
Date: 28 Mar 2004
From:ldc at ldc.upenn.edu (reposted from CORPORA)
Subject:Arabic Treebank Part 3 announced
LDC2004T11
* Arabic Treebank: Part 3 v.1.0 *
The Linguistic Data Consortium (LDC) is pleased to announce the
availability of
....
....
....
(2) Arabic Treebank: Part 3 v 1.0 is the third part of a corpus of
1,000,000 words of Arabic Treebank, designed to support language
research and development of language technology for Modern Standard
Arabic. This corpus includes 600 stories from the An Nahar News
Agency. There are a total of 340,281 words (counting non-Arabic tokens
such as numbers and punctuation) in the 600 files - one story per file.
New features of annotation include complete vocalization (including
case endings), lemma IDs, and more specific POS tags for verbs and
particles.
The corpus contains 293,035 Arabic-only word tokens (prior to the
separation of clitics), of which 290,842 (99.25%) were provided with an
acceptable morphological analysis and POS tag by the morphological
parser, and 2,193 (0.75%) were items that the morphological parser
failed to analyze correctly. Arabic Treebank: Part 3 v 1.0 is
distributed on 1 CD.
For further information, including online documentation, please visit:
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T11
Institutions that have membership in the LDC for the 2004 Membership
Year will be able to receive this corpus free of charge. Nonmembers
may license this data for US$3000.
------------------------------------------------------------------------
--
End of Arabic-L: 28 Mar 2004
More information about the Arabic-l
mailing list