[Corpora-List] Quranic Arabic Corpus – Version 0.1 Released
El-Haj, Mahmoud
melhaj at essex.ac.uk
Thu Nov 12 16:50:52 UTC 2009
Dear Kais,
First of all congratulations, this is indeed a huge effort and a very helpful website.
I worked on Quranic Arabic NLP for a while where I was mainly focusing on information retrieval and Quranic Thesaurus search in addition to Tajweed rules extraction.
I went through some chapters in your corpus and I have some concerns which I'll be happy if you could answer them to me.
Did you kept the Othmanic font? as some of the words misses the "Maddah" which is a character used instead of the Alef. For example:
الكوثر الآية 1 أَعْطَيْنَٰكَ
هود الآية 48 يَٰنُوحُ
are all written without the maddah, even with the existence of the diacritics still this could lead to different meanings.
The other thing is have you tried to chop the affixes such as (و) and (يا) I can see it perfectly described in the morphology but I guess having a light stemmer could help. Finally, the existents of diacritics resolves a huge part of the ambiguity but unfortunately most of the text available online is written with the absence of those diacritics, I know your focus is mainly on Quranic Arabic but are you planning to provide the same annotated Quranic corpus but with the absence of diacritics so it could be helpful to other Arabic texts.
Good luck.
Best wishes,
Mahmoud
-----Original Message-----
From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of Kais Dukes
Sent: Thursday, November 12, 2009 3:23 PM
To: corpora at uib.no
Subject: [Corpora-List] Quranic Arabic Corpus – Version 0.1 Released
Quranic Arabic Corpus – Version 0.1 Released
Hello All,
For those interested in Arabic part-of-speech tagging and syntactic analysis, a new resource has now be made available as a free open source download:
<http://quran.uk.net/>http://quran.uk.net
You can now obtain version 0.1 of the data which includes:
(1) A plain text file showing each word in every verse of the Quran, together with its (contextual) part-of-speech tag.
(2) The same data in XML format encoded as UTF-8
(3) A more detailed XML file with full morphological (inflection+derivation) feature tags
We plan to produce incremental updates until we reach version 1.0 - cross-annotator verification for full morphology and syntax of the Quran using dependency grammar. The Quranic Arabic Corpus is an annotated linguistic resource consisting of 77,430 words of Quranic Arabic. The research project is led by Kais Dukes at the University of Leeds, and is part of the Arabic language computing research group within the School of Computing, supervised by Eric Atwell. The project aims to provide a richly annotated linguistic resource for researchers wanting to study the Arabic language of the Quran. The grammatical analysis helps readers further in uncovering the detailed intended meanings of each verse and sentence. Each word of the Quran is tagged with its part-of-speech as well as multiple morphological features. Unlike other annotated Arabic corpora, the grammar framework adopted by the Quranic Corpus is the traditional Arabic grammar of i'rab.
The research project includes:
- A manually verified part-of-speech tagged Quranic Arabic corpus.
- An annotated treebank of Quranic Arabic.
- A novel visualization of traditional Arabic grammar through dependency graphs.
- Morphological search for the Quran.
- A machine-readable morphological lexicon of Quranic words into English.
- A part-of-speech concordance for Quranic Arabic organized by lemma.
- An online message board for community volunteer annotation.
The annotation for each of the 77,430 words in the Quran has been reviewed in stages by two annotators, and improvements are still ongoing to further improve accuracy.
Any feedback on the project is most welcome.
Kind Regards,
-- Kais Dukes
School of Computing
University of Leeds
web: http://quran.uk.net
e-mail: sckd at leeds.ac.uk<mailto:sckd at leeds.ac.uk>
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list