[Corpora-List] Quranic Arabic Corpus Version 0.2
Kais Dukes
sckd at leeds.ac.uk
Mon Feb 1 10:11:06 UTC 2010
Hello,
Apologies if you have received this e-mail more than once.
== Quranic Arabic Corpus Version 0.2 ==
Version 0.2 released today - Monday 1st Feburary, 2010. The Quranic
Arabic Corpus is an annotated linguistic resource which shows the
Arabic grammar, syntax and morphology for each word in the Quran. The
corpus provides three levels of analysis: morphological annotation, a
syntactic treebank and a semantic ontology. The research project is
organized at the University of Leeds, and is part of the Arabic
language computing research group within the School of Computing,
supervised by Eric Atwell.
This project aims to provide a richly annotated linguistic resource
for researchers wanting to study the original Arabic language of the
Quran. Each day on average, the website receives 10,000 page views and
over 1,500 visitors from 135 different countries world-wide. Following
user feedback, a new version of the corpus is now available with
several improvements to both the online website, as well as to the
annotated linguistic data:
http://corpus.quran.com
== Synopsis of New Features ==
Linguistics:
- Syntactic treebank now includes chapter 2 of the Quran
- Visual ontology with 300 concepts and 350 logical relations
- Named entity tagging, with 6000 Arabic words in the Quran identified
- Higher accuracy for part-of-speech tagging and morphological analysis
Data download:
- New parts-of-speech for particles (PRO/prohibition, SUP/supplemental)
- Improved English terminology for corresponding Arabic grammar terms
- Fixed typos in interlinear translation
- Fixed missing last verses in data download files
Website:
- Easier and quicker navigation with direct verse selection
- Search page now shows entire verses in Arabic and English
- Improved message board security with user sign-in and registration
== Linguistic Improvements ==
- The syntactic treebank uses dependency graphs to visualize the
parsed syntactic structure for Arabic verses in the Quran. Previously,
the treebank covered approximately 5,000 words (surat l-fatihah and
the last two juz of the Quran). In version 0.2, the treebank has been
extended to include chapter 2 (surat l-baqarah) and now covers over
11,000 Arabic words in the Quran with 2,500 dependency graphs. See:
http://corpus.quran.com/treebank.jsp
- The ontology of Quranic concepts is the largest new feature to be
added in this release. This shows a visual map of the names of people,
places and other entities mentioned in the Quran
(http://corpus.quran.com/ontology.jsp). Relationships between entities
are encoded using predicate logic (e.g. father/son, instance/subclass,
part-of, etc). At present, this is a basic ontology to enable a
further planned step of analysis, pronoun resolution. A brief webpage
has been written about each of the 300 concepts in the ontology,
providing a short synopsis, as well as showing predicate logic
relations. Users can add comments to each ontology concept page. It is
hoped that over time the ontology will grow into a small specialized
wiki of Quranic topics, formalized using machine-readable predicate
logic. Each page in the ontology is hyperlinked to the closest
corresponding page in Wikipedia, where applicable. A topic concordance
of concepts is also available (http://corpus.quran.com/topics.jsp)
which allows users to click through to easily find verse references
for each concept in the ontology.
- Named entity tagging in the Quranic corpus involves identifying
specific Arabic words (or spans of words) in verses, and mapping these
to well-defined formal concepts in the ontology. The word-by-word
grammatical annotation scheme on the website has been extended to show
links to the ontology. So far, 6,000 Arabic words have been tagged as
named entities and have been mapped to concepts. These include all
proper nouns in the corpus, as well as names of other specific
locations, places, animals and important events mentioned in the
Quran.
- A detailed linguistic review has been completed of all messages on
the message board. This has left 339 messages open for further
discussion, with 2,842 messages now resolved and archived. Version 0.2
of the corpus incorporates many improvements and suggestions from
volunteer annotators on how grammatical tagging might be improved.
This has resulted in much higher accuracy in the online grammatical
analysis for each Arabic word.
== Data Download Improvements ==
- Previously for part-of-speech tagging, the SUP tag was used for the
rare surprise particle. This has now been changed to SUR/surprise.
Version 0.2 of the corpus introduces two new part-of-speech tags for
particles, in order to achieve higher accuracy with regards to
traditional Arabic grammatical analysis (i'rab). A new tag
SUP/supplemental (harf za'id), has been introduced, as well as
PRO/prohibition. The latter is required to correctly distinguish
negative particles (NEG = harf nafee) from particles of prohibition
(PRO = harf nahee). Proper noun tagging has also been improved.
Completion of the initial draft of the ontology has allowed for a
clearer view on what should be tagged as a proper noun, based on
grammatical as well as semantic considerations.
- English terminology on the website has been improved for
corresponding Arabic grammatical terms. The syntactic treebank now
uses clearer English terminology and phrase tagging for jumlah fi'liya
/ ismiyah (VS / NS = verbal / nominal sentence). Previously these were
named "verb phrase" and "noun phrase" which may have led to some
confusion. There is also improved terminology for the rarer Quranic
verbal nouns, e.g. "imperative verbal noun" instead of just
"imperative noun" for "ism fi'il amr".
- Some typos have been fixed in the interlinear English translation.
This includes correcting some of the places where words have been
doubled up, as well as fixing missing occurrences of the word "zakah".
There are quite likely to be more improvements to be made in the
interlinear translation with regards to accuracy against traditional
accepted sources of translation into English. Comments are more than
welcome via the message board.
- The data download files for version 0.2 of the corpus have been
updated to include all these new improvements. The issue of missing
last verses when downloading data has been also now been fixed.
== Website Improvements ==
- A drop down verse list has been introduced across the website. This
allows for easier and quicker navigation with direct verse selection.
This was an often requested feature by regular website users.
- The search page now shows entire verses in Arabic and English. When
searching for a word or using the concordance functionality,
previously only a list of matching words would be displayed. Now, each
search result highlights the matching Arabic word and shows in its
entire verse in context. A corresponding English translation for each
verse is also displayed when searching, using the Sahih International
translation. Website users also have the option of using 8 different
English translations for wider context, including the word-by-word
interlinear translation.
- The message board now has improved security with user sign-in and
registration. The Quranic Arabic Corpus website receives many regular
visitors, including young students who use the website to learn about
Arabic grammar and to find out more about the Quran. This registration
process is intended to protect our users from spam, and to prevent
other unsuitable or potentially harmful messages from being posted to
the message board. Users can now also post messages to each of the 300
ontology concept pages, so that hopefully this new content can be
improved and extended over time.
- Non-technical interview with the muslim post (January 2010) -
http://corpus.quran.com/interview.jsp
- Linguistic academic paper (for submission) - "Kais Dukes and Tim
Buckwalter. A Dependency Treebank of the Quran using Traditional
Arabic Grammar." - http://corpus.quran.com/publications.jsp
== Feedback ==
Any feedback on version 0.2 of the Quranic Arabic Corpus is more than
welcome. The Quranic Arabic Corpus is made freely available under the
GNU public license and the corpus terms-of-use.
Kind Regards,
-- Kais Dukes
Language Research Group
School of Computing
University of Leeds
http://corpus.quran.com - The Quranic Arabic Corpus
comp-quran at comp.leeds.ac.uk - Computational Quranic Arabic discussion list
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list