Arabic-L:LING:New version of Quran Morphology Website
Dilworth Parkinson
dil at BYU.EDU
Wed Sep 16 19:07:44 UTC 2009
------------------------------------------------------------------------
Arabic-L: Wed 16 Sep 2009
Moderator: Dilworth Parkinson <dilworth_parkinson at byu.edu>
[To post messages to the list, send them to arabic-l at byu.edu]
[To unsubscribe, send message from same address you subscribed from to
listserv at byu.edu with first line reading:
unsubscribe arabic-l ]
-------------------------Directory------------------------------------
1) Subject:New version of Quran Morphology Website
-------------------------Messages-----------------------------------
1)
Date: 16 Sep 2009
From:dukes.kais at googlemail.com
Subject:New version of Quran Morphology Website
Hello,
Apologies for the mass email. Hopefully an e-mail list for this
project will soon be set up, that will allow you to subscribe/
unsubscribe if you are continuing to be interested in this project.
I have uploaded a new version of the website http://quran.uk.net.
There has been a lot of good feedback about the work being done, and I
have tried to respond to this by adding new features to the
morphological annotated corpus of the Quran. I am now continuing this
research under the supervision of Eric Atwell at the University of
Leeds, so the website now includes my School of Computing email as a
contact address. On busy days we get a few hundred visitors to the
website, and this has been growing over time.
(1) Corrections to the morphological and syntactic annotation of the
Quran. Over the last few months, corrections were suggested to nearly
1000 words (there are 77,430 words in the Quran). As a result of these
suggestions, accuracy is now much improved on key passages of the
text. I have gone through and reviewed each of these suggestions by
comparing against traditional sources, and I have approved about half
of them. The other half were mostly subtle tagging issues that it may
be hard to find agreement on. For example, adjectives as predicates
verses nouns, and nouns versus proper nouns. In classical Arabic these
distinctions do not usually change meaning, nor are the differences
critical for further syntactic analysis of the text. I plan to improve
the annotator guidelines to cover these cases. For now, I have
enforced consistency by reviewing each change made to the tagged corpus.
(2) Arabic terminology on the website. This is the biggest change to
the site. Hopefully including Arabic terms for the morphological
analysis will attract a bigger audience. I am generating the Arabic
analysis automatically form the morphological features. So if this are
wrong in any way, it would be great if annotators please let me know
before I make a public announcement about the new version of the
website via the mailing lists.
(3) Improvements to the segmentation scheme. Previously only attached
object pronouns were segmented (maf3ul bihi). Now subject pronouns
(fa3il) are also segmented, and the morphemes are shown in blue. This
is to keep the analysis more in line with traditional Arabic grammar.
This segmentation has been performed automatically according to
Traditional inflection rules, so I believe this should be quite
accurate. Annotators are welcome to review this new change.
(4) Audio recitation of the Quran. Some volunteers requested that a
feature be added whereby they can listen to each verse being recited
by an authentic source. My guess for why this is useful is perhaps
that the tone of voice makes disambiguation easier. For each verse,
you can now click the play button (at the bottom of the page) and hear
the verse in Arabic. Please allow time to load for slow connections.
(5) A link to the new treebank project. This has been included on the
main page. The idea here is that we want to attract more volunteers to
help with the syntactic analysis of the Quran.
(6) The root list has been reviewed and improved. We now have an
accurate root for each word in the Quran. The Buckwalter analyzer used
to provide the initial tagging did not provide roots, only stems.
However, I have managed to get hold of a more accurate root list than
before.
(7) Changes to the part-of-speech tags. The adverb tag (ADV) has been
removed. Instead, there are two new tags (LOC and T) for location and
time adverbs. This is to keep the tagging more aligned with
traditional Arabic grammar. In Arabic, these are tags for Dharf Makan,
and Dharf Zaman. If used as adverbs, these words will always be the
accusative case.
(8) Changes to preposition tagging. The preposition tag (P) is now
only used for Harf Jar (genitive prepositions), so that the P tag now
agrees 100% with traditional Arabic grammar. Words previously tagged
as P are now either nouns (N) or time/location adverbs (T/LOC)
depending on context. The idea behind this change was that Traditional
Arabic grammar defines a set of prepositions (harf jar) and we were
not previously using this list, and we used to confuse T/LOC adverbs
(Dharf Makan/Zaman) as prepositions.
(9) Proper nouns. The list of proper nouns has been extended. For
example, Satan and Quran are now considered to be proper nouns.
(10) Entries in the dictionary are now sorted by verb form. The
lexicon page shows words in the Quran grouped by root. The words are
now subdivided according to form I, form II, form III, etc.
Outstanding work...
Here are a list of other good suggestions that never made it into this
release. Hopefully they will be included in the next version.
(1) Changes to feminine/masculine. They have been quite a few
suggestions that we change some the gender of various words. This
needs to be reviewed.
(2) We should show root counts in the dictionary. This will help with
manual verification against published root lists of the Quran.
(3) We should consider showing the pattern of each word, as well as
the root. The Buckwalter analyzed used to produce the initial tagging
didn’t give us patterns. However, now that we have accurate roots and
words, it may be possible to derive the patterns automatically. One
idea would be to use regular expressions.
(4) Linguistic search tool. Similar to the search tool for the British
National Corpus, we should be able to search by word, part-of-speech
tag and proximity, e.g. 20 words away from another word.
(5) Translations. We should include multiple English translations and
allow searches over them.
(6) A testimonials page, listing the positive feedback to the project.
Might encourage other interested volunteers to join.
(7) For each word, give links to existing Arabic lexicons showing the
analysis of the word. Might speed up annotation and corrections.
Any feedback is welcome! If you are interested in volunteering for the
morphology or for the treebank projects do let me know.
Kind Regards,
-- Kais Dukes
School of Computing
University of Leeds
--------------------------------------------------------------------------
End of Arabic-L: 16 Sep 2009
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/arabic-l/attachments/20090916/7c30bff1/attachment.htm>
More information about the Arabic-l
mailing list