[Corpora-List] New Version 0.3 of the Quranic Arabic Corpus

Kais Dukes sckd at leeds.ac.uk
Sat Mar 12 17:23:42 UTC 2011


Apologies for cross-posting.

The Quranic Arabic Corpus (http://corpus.quran.com) is an international collaborative linguistic project initiated at the University of Leeds that aims to bridge the gap between the traditional Arabic grammar of i'rab and techniques from modern computational linguistics. This open source resource includes word-by-word part-of-speech tagging for the Quran, morphological segmentation and a formal representation of Quranic Arabic syntax using dependency graphs. Version 0.3 of the corpus includes a number of significant improvements over the previous 0.2 release:

*** [Increased coverage for the syntactic treebank]. The treebank now covers 30% of the Quran by word count (hence the version 0.3 release number). The syntactic treebank provides annotation using dependency grammar for chapters 1-5 and 59-114, covering 23,292 out of 77,430 words in the Quran. The treebank also includes a revised set of non-terminal phrase tags for nominal sentences (jumlah ismiyah), verbal sentences (jumlah fi'iyah), and conditional sentences (jumlah shartiyah).

*** [Improved accuracy for tagging and morphological analysis] covering 100% of the Quranic text. Following online collaboration by volunteer annotators, over 2,000 suggestions for improved part-of-speech and morphological tagging have been reviewed in detail and cross-checked against traditional sources of Arabic grammar, resulting in further improvements to the accuracy of the annotated resource.

*** [More consistent morphological segmentation]. Each of the 77,430 words in the Quran has been morphologically segmented, resulting in 128,076 individual morphemes. In accordance with traditional Arabic grammar, each morpheme has been separately tagged for part-of-speech and multiple morphological features including noun case and verb mood, gender, number and person. The improved segmentation used in version 0.3 of the corpus is more consistent with i'rab. For example, the suffixed nun of emphasis (nun l-tawkid) is now correctly analysed as a separate morphological segment.

*** [High-resolution vector graphics for the Quranic script] is now used to display Arabic words in dependency graphs, replacing the previous use of glyph-based fonts. The script is now based on electronic scans developed by the Quran Printing Complex. This has resulted in improved typographic accuracy for the Arabic words displayed in the syntactic treebank, most notably for ligatures, verse pause marks, and diacritic alignment. Previously a TrueType font was used to render Arabic words in dependency graphs, which did not always accurately represent the intricacies of the Quranic Uthmani script.

*** [An extended tagset with finer grained part-of-speech tags] including INT - particle of interpretation (harf tafsir), CIRC - for the circumstantial usage of the particle waw (waw l-haliyah), COM - for the comitative usage of the particle waw (waw l-ma'iyah) and RSLT (for the result usage of the particle fa). In addition, for better consistency with traditional Arabic grammar, the NUM tag has been replaced for numerical words with ADJ (adjective) or N (noun) tags, depending on syntactic function and context.

*** [Better natural language generation] for automatic summaries of linguistic annotation. For example, when a first person object pronoun suffix is represented only by a terminal kasrah diacritic (instead of the more usual ya suffix), this is now correctly mentioned in the word-by-word annotation displayed online.

*** [Links to updated academic publications] on the Quranic Arabic Corpus: 2 LREC papers, INFOS 2010 paper, a FAL book chapter, and a LRE Journal paper, together with a link to an online review of the Quranic Arabic Corpus at Examiner.com. The full versions of these papers are now available as PDF downloads from the Quranic Arabic Corpus website. These publications and articles explain in detail the original research contributions of the Quranic Arabic Corpus project.

*** [Improved online documentation] for the corpus, and additional sections in the online annotation guidelines, most notably a new detailed section on the different types of verb forms in Quranic Arabic morphology.

*** [Enhanced morphological search] for the Quran, including the ability to search on additional part-of-speech tags and linguistic features.

*** [Version 0.3 of the reviewed morphologically annotated data] is freely available for download from the Quranic Arabic Corpus website.

The Quranic Arabic Corpus is an open source project. Contributions or questions about the research are more than welcome. Please direct any correspondence to Kais Dukes, PhD researcher at the School of Computing, University of Leeds:

web: www.kaisdukes.com
e-mail: sckd at leeds.ac.uk

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list