[Corpora-List] Arabic transliteration

Tim Buckwalter timbuck2 at ldc.upenn.edu
Mon Oct 22 17:37:35 UTC 2007


Dear Corpora friends,
There has been some resistance to using the Buckwalter transliteration 
in NLP because some of the characters interfere with XML and regular 
expressions, or are just too cryptic: ` * < > | & }. Dil Parkinson at 
BYU avoids these problems in his own transliteration scheme, which 
replaces the above problematic characters with alphabetic ones (but 
remains somewhat cryptic, in my opinion). Some modifications of the 
Buckwalter transliteration scheme replace the problematic characters 
with digits. Additional modifications have been made for representing 
characters outside of the basic Arabic character set, such as Persian 
characters. The introduction of digraphs by the Archimedes Project 
research team at Harvard 
(http://archimedes.fas.harvard.edu/docs/Arabic/) is an interesting 
modification of the Buckwalter transliteration, because some systematic 
use of digraphs might be needed for transliterating languages that use 
Arabic characters outside the basic range. Some arbitrary mapping of 
digits or letters to Arabic characters is inevitable, but the goal is 
simply to represent unambiguously how the language is written, allowing 
for one-to-one mapping to Unicode Arabic and back. When I developed my 
Arabic transliteration system (with Ken Beesley at Alpnet in Provo, 
Utah, 1989), we needed to represent native Arabic orthography with a 
Latin-based scheme that was easy to input on ordinary keyboards, that 
used upper- and lower-case characters, but no accented (upper ASCII) 
characters. In other words, we needed a 7-bit representation of the 
Arabic writing system that would be suitable for NLP, especially in 
contexts where native Arabic characters could not be easily input or 
displayed (especially with bi-directional issues), or where non-Arabists 
needed to read and make some sense of Arabic text data. I feel that in 
many NLP publications where the focus is not necessarily a discussion of 
Arabic orthography, IPA or modified LC transliteration would be more 
suitable, but note that most of these schemes require use of upper-ASCII 
(8-bit), Latin Extended, or even Greek characters. Although this can 
work well in printed publications (such as the recently published book 
that Nizar mentioned), this kind of data does not travel well by e-mail 
or across platforms, nor as safely as 7-bit data. In any case, 7-bit 
data is easy to input on all platforms.
Finally, there is a very nice online utility created by Ota Smrz for 
converting among several Arabic transliteration schemes:
   http://ufal.mff.cuni.cz/cgi-bin/smrz/Encode/Arabic/index.fcgi
-- Tim Buckwalter



_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list