Arabic-L:LING:New version of Quran Morphology Website

Dilworth Parkinson dil at BYU.EDU
Wed Sep 16 19:07:44 UTC 2009


------------------------------------------------------------------------
Arabic-L: Wed 16 Sep 2009
Moderator: Dilworth Parkinson <dilworth_parkinson at byu.edu>
[To post messages to the list, send them to arabic-l at byu.edu]
[To unsubscribe, send message from same address you subscribed from to
listserv at byu.edu with first line reading:
            unsubscribe arabic-l                                      ]

-------------------------Directory------------------------------------

1) Subject:New version of Quran Morphology Website

-------------------------Messages-----------------------------------
1)
Date: 16 Sep 2009
From:dukes.kais at googlemail.com
Subject:New version of Quran Morphology Website

Hello,

Apologies for the mass email. Hopefully an e-mail list for this  
project will soon be set up, that will allow you to subscribe/ 
unsubscribe if you are continuing to be interested in this project.

I have uploaded a new version of the website http://quran.uk.net.  
There has been a lot of good feedback about the work being done, and I  
have tried to respond to this by adding new features to the  
morphological annotated corpus of the Quran. I am now continuing this  
research under the supervision of Eric Atwell at the University of  
Leeds, so the website now includes my School of Computing email as a  
contact address. On busy days we get a few hundred visitors to the  
website, and this has been growing over time.

(1) Corrections to the morphological and syntactic annotation of the  
Quran. Over the last few months, corrections were suggested to nearly  
1000 words (there are 77,430 words in the Quran). As a result of these  
suggestions, accuracy is now much improved on key passages of the  
text. I have gone through and reviewed each of these suggestions by  
comparing against traditional sources, and I have approved about half  
of them. The other half were mostly subtle tagging issues that it may  
be hard to find agreement on. For example, adjectives as predicates  
verses nouns, and nouns versus proper nouns. In classical Arabic these  
distinctions do not usually change meaning, nor are the differences  
critical for further syntactic analysis of the text. I plan to improve  
the annotator guidelines to cover these cases. For now, I have  
enforced consistency by reviewing each change made to the tagged corpus.

(2) Arabic terminology on the website. This is the biggest change to  
the site. Hopefully including Arabic terms for the morphological  
analysis will attract a bigger audience. I am generating the Arabic  
analysis automatically form the morphological features. So if this are  
wrong in any way, it would be great if annotators please let me know  
before I make a public announcement about the new version of the  
website via the mailing lists.

(3) Improvements to the segmentation scheme. Previously only attached  
object pronouns were segmented (maf3ul bihi). Now subject pronouns  
(fa3il) are also segmented, and the morphemes are shown in blue. This  
is to keep the analysis more in line with traditional Arabic grammar.  
This segmentation has been performed automatically according to  
Traditional inflection rules, so I believe this should be quite  
accurate. Annotators are welcome to review this new change.

(4) Audio recitation of the Quran. Some volunteers requested that a  
feature be added whereby they can listen to each verse being recited  
by an authentic source. My guess for why this is useful is perhaps  
that the tone of voice makes disambiguation easier. For each verse,  
you can now click the play button (at the bottom of the page) and hear  
the verse in Arabic. Please allow time to load for slow connections.

(5) A link to the new treebank project. This has been included on the  
main page. The idea here is that we want to attract more volunteers to  
help with the syntactic analysis of the Quran.

(6) The root list has been reviewed and improved. We now have an  
accurate root for each word in the Quran. The Buckwalter analyzer used  
to provide the initial tagging did not provide roots, only stems.  
However, I have managed to get hold of a more accurate root list than  
before.

(7) Changes to the part-of-speech tags. The adverb tag (ADV) has been  
removed. Instead, there are two new tags (LOC and T) for location and  
time adverbs. This is to keep the tagging more aligned with  
traditional Arabic grammar. In Arabic, these are tags for Dharf Makan,  
and Dharf Zaman. If used as adverbs, these words will always be the  
accusative case.

(8) Changes to preposition tagging. The preposition tag (P) is now  
only used for Harf Jar (genitive prepositions), so that the P tag now  
agrees 100% with traditional Arabic grammar. Words previously tagged  
as P are now either nouns (N) or time/location adverbs (T/LOC)  
depending on context. The idea behind this change was that Traditional  
Arabic grammar defines a set of prepositions (harf jar) and we were  
not previously using this list, and we used to confuse T/LOC adverbs  
(Dharf Makan/Zaman) as prepositions.

(9) Proper nouns. The list of proper nouns has been extended. For  
example, Satan and Quran are now considered to be proper nouns.

(10) Entries in the dictionary are now sorted by verb form. The  
lexicon page shows words in the Quran grouped by root. The words are  
now subdivided according to form I, form II, form III, etc.

Outstanding work...

Here are a list of other good suggestions that never made it into this  
release. Hopefully they will be included in the next version.

(1) Changes to feminine/masculine. They have been quite a few  
suggestions that we change some the gender of various words. This  
needs to be reviewed.

(2) We should show root counts in the dictionary. This will help with  
manual verification against published root lists of the Quran.

(3) We should consider showing the pattern of each word, as well as  
the root. The Buckwalter analyzed used to produce the initial tagging  
didn’t give us patterns. However, now that we have accurate roots and  
words, it may be possible to derive the patterns automatically. One  
idea would be to use regular expressions.

(4) Linguistic search tool. Similar to the search tool for the British  
National Corpus, we should be able to search by word, part-of-speech  
tag and proximity, e.g. 20 words away from another word.

(5) Translations. We should include multiple English translations and  
allow searches over them.

(6) A testimonials page, listing the positive feedback to the project.  
Might encourage other interested volunteers to join.

(7) For each word, give links to existing Arabic lexicons showing the  
analysis of the word. Might speed up annotation and corrections.

Any feedback is welcome! If you are interested in volunteering for the  
morphology or for the treebank projects do let me know.

Kind Regards,

-- Kais Dukes
School of Computing
University of Leeds


--------------------------------------------------------------------------
End of Arabic-L:  16 Sep 2009
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/arabic-l/attachments/20090916/7c30bff1/attachment.htm>


More information about the Arabic-l mailing list