[Corpora-List] Arabic transliteration
Tim Buckwalter
timbuck2 at ldc.upenn.edu
Mon Oct 22 17:37:35 UTC 2007
Dear Corpora friends,
There has been some resistance to using the Buckwalter transliteration
in NLP because some of the characters interfere with XML and regular
expressions, or are just too cryptic: ` * < > | & }. Dil Parkinson at
BYU avoids these problems in his own transliteration scheme, which
replaces the above problematic characters with alphabetic ones (but
remains somewhat cryptic, in my opinion). Some modifications of the
Buckwalter transliteration scheme replace the problematic characters
with digits. Additional modifications have been made for representing
characters outside of the basic Arabic character set, such as Persian
characters. The introduction of digraphs by the Archimedes Project
research team at Harvard
(http://archimedes.fas.harvard.edu/docs/Arabic/) is an interesting
modification of the Buckwalter transliteration, because some systematic
use of digraphs might be needed for transliterating languages that use
Arabic characters outside the basic range. Some arbitrary mapping of
digits or letters to Arabic characters is inevitable, but the goal is
simply to represent unambiguously how the language is written, allowing
for one-to-one mapping to Unicode Arabic and back. When I developed my
Arabic transliteration system (with Ken Beesley at Alpnet in Provo,
Utah, 1989), we needed to represent native Arabic orthography with a
Latin-based scheme that was easy to input on ordinary keyboards, that
used upper- and lower-case characters, but no accented (upper ASCII)
characters. In other words, we needed a 7-bit representation of the
Arabic writing system that would be suitable for NLP, especially in
contexts where native Arabic characters could not be easily input or
displayed (especially with bi-directional issues), or where non-Arabists
needed to read and make some sense of Arabic text data. I feel that in
many NLP publications where the focus is not necessarily a discussion of
Arabic orthography, IPA or modified LC transliteration would be more
suitable, but note that most of these schemes require use of upper-ASCII
(8-bit), Latin Extended, or even Greek characters. Although this can
work well in printed publications (such as the recently published book
that Nizar mentioned), this kind of data does not travel well by e-mail
or across platforms, nor as safely as 7-bit data. In any case, 7-bit
data is easy to input on all platforms.
Finally, there is a very nice online utility created by Ota Smrz for
converting among several Arabic transliteration schemes:
http://ufal.mff.cuni.cz/cgi-bin/smrz/Encode/Arabic/index.fcgi
-- Tim Buckwalter
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list