Arabic-L:LING:Morph + POS tagged Quran

Tue Mar 31 22:34:24 UTC 2009

------------------------------------------------------------------------
Arabic-L: Tue 31 Mar 2009
Moderator: Dilworth Parkinson <dilworth_parkinson at byu.edu>
[To post messages to the list, send them to arabic-l at byu.edu]
[To unsubscribe, send message from same address you subscribed from to
listserv at byu.edu with first line reading:
            unsubscribe arabic-l                                      ]

-------------------------Directory------------------------------------

1) Subject:Morph + POS tagged Quran

-------------------------Messages-----------------------------------
1)
Date: 31 Mar 2009
From:kais.dukes at jqurantree.org
Subject:Morph + POS tagged Quran

[Kais sent this note to me (the moderator), but agreed that I could  
post it to the group for your information--dil]

Hi Dil,

If you recall we spoke some weeks ago about a POS tagged Quran. I  
currently have some exciting results I would like to share with you,  
in the hope of getting your opinion. I have ported Tim Buckwalter's  
BAMA analyzer to Java, and integrated it into the JQuranTree API. I  
then ran the analyzer against the Quranic text. I found a problem in  
that BAMA produces many possible results for each token, usually  
around 5 but in extreme cases up to 26. However, I was able to find a  
way to rank these results using a scoring function (described below).

The results are a partially accurate POS + morph tagged Quran. I have  
put up a web interface so that the tagged Quran can be browsed online:

http://jqurantree.org/morphology/

I would really appreciate some feedback on this. I know still work in  
progress, but I am so far encouraged by the results, as can be seen on  
the web page.

The current analyzer (BAMA + scoring function) seems to work better on  
some of the shorter suras (i.e. chapter 80 onwards) although this  
could just be my impression. The scoring function assigns an integer  
(+ve, zero or -ve) to each candidate BAMA solution for each token. The  
BAMA result with the highest score is then chosen as the unique morph  
analysis for that token in the Quran. Sometimes BAMA suggests  
alternative spellings when the original spelling is not found, thus  
the scoring function is:

Step 1. for each letter in candidate BAMA result, if letter matches  
the letter original word at the same position, then +10, else -10 Step  
2. then if the letter matches, for each diacritic in the BAMA result's  
letter, if that diacritic is present in the original word then +1 else  
-1

It would be great if you could have a quick look at the data. I am now  
thinking what to do next. My aim is to push up the accuracy of the POS/ 
Morph tagger as far as possible. Some ideas come to mind

1) I could select N tokens (N = 100, N = 1000?) and manually go  
through them to give the current analyzer a % accuracy score (or F- 
measure, i.e. accuracy and recall harmonic mean?).
2) Another idea is to make a list of missing words and their frequencies
3) Some other work I could do ... make a list of the POS tags coming  
out of BAMA. I am not sure what all of these mean, or how many they  
are, although I skimmed through Tim Buckwalter's documentation and  
that looked quite comprehensive. Perhaps I should map the POS tag set  
to something more standard or well known?

Looking forward to your replies.

Kind Regards,

-- Kais dukes

--------------------------------------------------------------------------
End of Arabic-L:  31 Mar 2009