Arabic-L:LING:Morph + POS tagged Quran
Dilworth Parkinson
dil at BYU.EDU
Tue Mar 31 22:34:24 UTC 2009
------------------------------------------------------------------------
Arabic-L: Tue 31 Mar 2009
Moderator: Dilworth Parkinson <dilworth_parkinson at byu.edu>
[To post messages to the list, send them to arabic-l at byu.edu]
[To unsubscribe, send message from same address you subscribed from to
listserv at byu.edu with first line reading:
unsubscribe arabic-l ]
-------------------------Directory------------------------------------
1) Subject:Morph + POS tagged Quran
-------------------------Messages-----------------------------------
1)
Date: 31 Mar 2009
From:kais.dukes at jqurantree.org
Subject:Morph + POS tagged Quran
[Kais sent this note to me (the moderator), but agreed that I could
post it to the group for your information--dil]
Hi Dil,
If you recall we spoke some weeks ago about a POS tagged Quran. I
currently have some exciting results I would like to share with you,
in the hope of getting your opinion. I have ported Tim Buckwalter's
BAMA analyzer to Java, and integrated it into the JQuranTree API. I
then ran the analyzer against the Quranic text. I found a problem in
that BAMA produces many possible results for each token, usually
around 5 but in extreme cases up to 26. However, I was able to find a
way to rank these results using a scoring function (described below).
The results are a partially accurate POS + morph tagged Quran. I have
put up a web interface so that the tagged Quran can be browsed online:
http://jqurantree.org/morphology/
I would really appreciate some feedback on this. I know still work in
progress, but I am so far encouraged by the results, as can be seen on
the web page.
The current analyzer (BAMA + scoring function) seems to work better on
some of the shorter suras (i.e. chapter 80 onwards) although this
could just be my impression. The scoring function assigns an integer
(+ve, zero or -ve) to each candidate BAMA solution for each token. The
BAMA result with the highest score is then chosen as the unique morph
analysis for that token in the Quran. Sometimes BAMA suggests
alternative spellings when the original spelling is not found, thus
the scoring function is:
Step 1. for each letter in candidate BAMA result, if letter matches
the letter original word at the same position, then +10, else -10 Step
2. then if the letter matches, for each diacritic in the BAMA result's
letter, if that diacritic is present in the original word then +1 else
-1
It would be great if you could have a quick look at the data. I am now
thinking what to do next. My aim is to push up the accuracy of the POS/
Morph tagger as far as possible. Some ideas come to mind
1) I could select N tokens (N = 100, N = 1000?) and manually go
through them to give the current analyzer a % accuracy score (or F-
measure, i.e. accuracy and recall harmonic mean?).
2) Another idea is to make a list of missing words and their frequencies
3) Some other work I could do ... make a list of the POS tags coming
out of BAMA. I am not sure what all of these mean, or how many they
are, although I skimmed through Tim Buckwalter's documentation and
that looked quite comprehensive. Perhaps I should map the POS tag set
to something more standard or well known?
Looking forward to your replies.
Kind Regards,
-- Kais dukes
--------------------------------------------------------------------------
End of Arabic-L: 31 Mar 2009
More information about the Arabic-l
mailing list