[Corpora-List] New Version 0.4 of the Quranic Arabic Corpus

Kais Dukes sckd at leeds.ac.uk
Mon May 2 09:23:33 UTC 2011


Apologies for cross-posting.

The Quranic Arabic Corpus (http://corpus.quran.com) is an international collaborative linguistic project initiated at the University of Leeds, that aims to bridge the gap between the traditional Arabic grammar of i'rab and techniques from modern computational linguistics. This open source resource includes part-of-speech tagging for the Quran, morphological segmentation and a formal representation of Quranic syntax using dependency graphs. Version 0.4 of the corpus provides several improvements over the previous release:

*** [Increased coverage for the syntactic treebank]. Version 0.4 of the treebank covers 40% of the Quran by word count (30,895 out of 77,429 words). The treebank provides syntactic annotation using dependency grammar for chapters 1-8 and 59-114 of the Quran.

*** [Revised morphological analysis]. Following online collaboration by volunteer annotators, over 500 suggestions have cross-checked against traditional sources of Arabic grammar, resulting in more accurate morphological tagging.

*** [Improved Quran dictionary and lemmatization]. The list of roots and lemmas that group related derived words has been made more consistent with traditional Arabic lexicons. The online Quran dictionary now also includes concordance lines from Quranic verses as context.

*** [Readability and navigation improvements]. The content of the website has been better organized, with improvements to navigation and layout. Several typing mistakes and omissions have been corrected in the word by word interlinear translation into English.

*** [More accurate tagging of proper nouns]. Eight new named entities have been added to the semantic ontology that were previously tagged only as nouns: Al-Ahqaf, Al-Jahiliyah, Al-Jumu'ah, Baal, Magians, Salsabil, Sirius, and Zaqqum.

*** [More accurate tagging for particles waw and fa]. In accordance with traditional Arabic grammar, for certain words, the particle fa is now tagged as a supplemental particle (harf za'id), such as in the combination a-fa-man.

*** [Version 0.4 of the morphologically annotated corpus] is freely available for download from the Quranic Arabic Corpus website.

The Quranic Arabic Corpus is an open source project. Contributions or questions about the research are more than welcome. Please direct any correspondence to Kais Dukes, PhD researcher at the School of Computing, University of Leeds:

web: www.kaisdukes.com
e-mail: sckd at leeds.ac.uk

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list