[Corpora-List] Quranic Arabic Corpus – Version 0.1 Released

Kais Dukes sckd at leeds.ac.uk
Thu Nov 12 15:22:58 UTC 2009


Quranic Arabic Corpus – Version 0.1 Released

Hello All,

For those interested in Arabic part-of-speech tagging and syntactic analysis, a new resource has now be made available as a free open source download:

<http://quran.uk.net/>http://quran.uk.net

You can now obtain version 0.1 of the data which includes:

(1) A plain text file showing each word in every verse of the Quran, together with its (contextual) part-of-speech tag.
(2) The same data in XML format encoded as UTF-8
(3) A more detailed XML file with full morphological (inflection+derivation) feature tags

We plan to produce incremental updates until we reach version 1.0 - cross-annotator verification for full morphology and syntax of the Quran using dependency grammar. The Quranic Arabic Corpus is an annotated linguistic resource consisting of 77,430 words of Quranic Arabic. The research project is led by Kais Dukes at the University of Leeds, and is part of the Arabic language computing research group within the School of Computing, supervised by Eric Atwell. The project aims to provide a richly annotated linguistic resource for researchers wanting to study the Arabic language of the Quran. The grammatical analysis helps readers further in uncovering the detailed intended meanings of each verse and sentence. Each word of the Quran is tagged with its part-of-speech as well as multiple morphological features. Unlike other annotated Arabic corpora, the grammar framework adopted by the Quranic Corpus is the traditional Arabic grammar of i'rab.

The research project includes:

- A manually verified part-of-speech tagged Quranic Arabic corpus.
- An annotated treebank of Quranic Arabic.
- A novel visualization of traditional Arabic grammar through dependency graphs.
- Morphological search for the Quran.
- A machine-readable morphological lexicon of Quranic words into English.
- A part-of-speech concordance for Quranic Arabic organized by lemma.
- An online message board for community volunteer annotation.

The annotation for each of the 77,430 words in the Quran has been reviewed in stages by two annotators, and improvements are still ongoing to further improve accuracy.

Any feedback on the project is most welcome.

Kind Regards,

-- Kais Dukes
School of Computing
University of Leeds

web: http://quran.uk.net
e-mail: sckd at leeds.ac.uk<mailto:sckd at leeds.ac.uk>


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list