Arabic-L:LING:Quranic Arabic Corpus-Version 0.1 Released

Thu Nov 12 20:39:54 UTC 2009

------------------------------------------------------------------------
Arabic-L: Thu 12 Nov 2009
Moderator: Dilworth Parkinson <dilworth_parkinson at byu.edu>
[To post messages to the list, send them to arabic-l at byu.edu]
[To unsubscribe, send message from same address you subscribed from to
listserv at byu.edu with first line reading:
             unsubscribe arabic-l                                      ]

-------------------------Directory------------------------------------

1) Subject:Quranic Arabic Corpus-Version 0.1 Released

-------------------------Messages-----------------------------------
1)
Date: 12 Nov 2009
From:Kais Dukes <dukes.kais at googlemail.com>

Subject:Quranic Arabic Corpus-Version 0.1 Released

Hello All,

For those interested in Arabic part-of-speech tagging and syntactic  
analysis, a new resource has now be made available as a free open  
source download:

http://quran.uk.net

You can now obtain version 0.1 of the data which includes:

(1) A plain text file showing each word in every verse of the Quran,  
together with its (contextual) part-of-speech tag.
(2) The same data in XML format encoded as UTF-8
(3) A more detailed XML file with full morphological (inflection 
+derivation) feature tags

We plan to produce incremental updates until we reach version 1.0 -  
cross-annotator verification for full morphology and syntax of the  
Quran using dependency grammar. The Quranic Arabic Corpus is an  
annotated linguistic resource consisting of 77,430 words of Quranic  
Arabic. The research project is led by Kais Dukes at the University of  
Leeds, and is part of the Arabic language computing research group  
within the School of Computing, supervised by Eric Atwell. The project  
aims to provide a richly annotated linguistic resource for researchers  
wanting to study the Arabic language of the Quran. The grammatical  
analysis helps readers further in uncovering the detailed intended  
meanings of each verse and sentence. Each word of the Quran is tagged  
with its part-of-speech as well as multiple morphological features.  
Unlike other annotated Arabic corpora, the grammar framework adopted  
by the Quranic Corpus is the traditional Arabic grammar of i'rab.

The research project includes:

- A manually verified part-of-speech tagged Quranic Arabic corpus.
- An annotated treebank of Quranic Arabic.
- A novel visualization of traditional Arabic grammar through  
dependency graphs.
- Morphological search for the Quran.
- A machine-readable morphological lexicon of Quranic words into  
English.
- A part-of-speech concordance for Quranic Arabic organized by lemma.
- An online message board for community volunteer annotation.

The annotation for each of the 77,430 words in the Quran has been  
reviewed in stages by two annotators, and improvements are still  
ongoing to further improve accuracy.

Any feedback on the project is most welcome.

Kind Regards,

-- Kais Dukes
School of Computing
University of Leeds

web: http://quran.uk.net
e-mail: sckd at leeds.ac.uk

--------------------------------------------------------------------------
End of Arabic-L:  12 Nov 2009

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/arabic-l/attachments/20091112/86248b9e/attachment.htm>