Arabic-L:LING:New Version 0.4 of the Quranic Arabic Corpus
Dilworth Parkinson
dil at BYU.EDU
Mon May 2 23:19:49 UTC 2011
------------------------------------------------------------------------
Arabic-L: Mon 02 May 2011
Moderator: Dilworth Parkinson <dil at byu.edu>
[To post messages to the list, send them to arabic-l at byu.edu]
[To unsubscribe, send message from same address you subscribed from to
listserv at byu.edu with first line reading:
unsubscribe arabic-l ]
-------------------------Directory------------------------------------
1) Subject: New Version 0.4 of the Quranic Arabic Corpus
-------------------------Messages-----------------------------------
1)
Date: 02 May 2011
From: Kais Dukes <sckd at leeds.ac.uk>
Subject: New Version 0.4 of the Quranic Arabic Corpus
The Quranic Arabic Corpus (http://corpus.quran.com) is an international collaborative linguistic project initiated at the University of Leeds, that aims to bridge the gap between the traditional Arabic grammar of i'rab and techniques from modern computational linguistics. This open source resource includes part-of-speech tagging for the Quran, morphological segmentation and a formal representation of Quranic syntax using dependency graphs. Version 0.4 of the corpus provides several improvements over the previous release:
*** [Increased coverage for the syntactic treebank]. Version 0.4 of the treebank covers 40% of the Quran by word count (30,895 out of 77,429 words). The treebank provides syntactic annotation using dependency grammar for chapters 1-8 and 59-114 of the Quran.
*** [Revised morphological analysis]. Following online collaboration by volunteer annotators, over 500 suggestions have cross-checked against traditional sources of Arabic grammar, resulting in more accurate morphological tagging.
*** [Improved Quran dictionary and lemmatization]. The list of roots and lemmas that group related derived words has been made more consistent with traditional Arabic lexicons. The online Quran dictionary now also includes concordance lines from Quranic verses as context.
*** [Readability and navigation improvements]. The content of the website has been better organized, with improvements to navigation and layout. Several typing mistakes and omissions have been corrected in the word by word interlinear translation into English.
*** [More accurate tagging of proper nouns]. Eight new named entities have been added to the semantic ontology that were previously tagged only as nouns: Al-Ahqaf, Al-Jahiliyah, Al-Jumu'ah, Baal, Magians, Salsabil, Sirius, and Zaqqum.
*** [More accurate tagging for particles waw and fa]. In accordance with traditional Arabic grammar, for certain words, the particle fa is now tagged as a supplemental particle (harf za'id), such as in the combination a-fa-man.
*** [Version 0.4 of the morphologically annotated corpus] is freely available for download from the Quranic Arabic Corpus website.
The Quranic Arabic Corpus is an open source project. Contributions or questions about the research are more than welcome. Please direct any correspondence to Kais Dukes, PhD researcher at the School of Computing, University of Leeds:
web: www.kaisdukes.com
e-mail: sckd at leeds.ac.uk
--------------------------------------------------------------------------
End of Arabic-L: 02 May 2011
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/arabic-l/attachments/20110502/68169c3e/attachment.htm>
More information about the Arabic-l
mailing list