Arabic-L:LING:Arabic Transliteration Discussion from Corpora
Dilworth Parkinson
dilworth_parkinson at BYU.EDU
Wed Oct 24 18:52:52 UTC 2007
------------------------------------------------------------------------
Arabic-L: Wed 24 Oct 2007
Moderator: Dilworth Parkinson <dilworth_parkinson at byu.edu>
[To post messages to the list, send them to arabic-l at byu.edu]
[To unsubscribe, send message from same address you subscribed from to
listserv at byu.edu with first line reading:
unsubscribe arabic-l ]
-------------------------Directory------------------------------------
1) Subject:Arabic Transliteration Discussion from Corpora
-------------------------Messages-----------------------------------
1)
Date: 24 Oct 2007
From:from CORPORA
Subject:Arabic Transliteration Discussion from Corpora
[The following discussion of Arabic Transliteration took place on
Corpora. There was info and sites I had not been acquainted with, so
I though some of you might want to check it out as well. I've left
out most of the names (you could look them up on the Corpora
archives, except for Tim's who is a regular contributor to Arabic-L.]
############
I am looking for a standard for Arabic transliteration
using Latin characters
Can anybody give me a hint?
Thanks a lot,
############
We use the Buckwalter transliteration system which is a one 2 one
encoding of Arabic letters into latin script. One issue with this
transliteration system is that it uses some special characters such
as “><{}*” but you can always use alternates for these especially if
you use xml tagging.
Here is a pointer
http://www.qamus.org/transliteration.htm
############
I would actually advise that it is not a good idea to use a
latin-character transliteration for Arabic text unless you really have
to. In Arabic computational linguistics, the Buckwalter transliteration
scheme is probably the most widely known, see
http://www.qamus.org/transliteration.htm
BUT few Arabic scholars outside of Computational Linguistics would
be at all familiar with this transliteration.
This was developed a a time when computers could not handle Arabic
script, only ASCII Latin characters. More recently, most computers
use the
UNICODE character set whcih includes encoding of the Arabic alphabet,
so it is now relatively straighforward to display, process, and
concordance Arabic text in original Arabic script, e.g. see
Roberts, Andrew; Al-Sulaiti, Latifa; Atwell, Eric. aConCorde: Towards an
open-source, extendable concordancer for Arabic. Corpora, vol. 1, pp.
39-57. 2006.
Why do you want to transliterate your Arabic text to the Latin alphabet?
The great thing about standards is that there are so many to choose
from¹ :)
http://en.wikipedia.org/wiki/
Arabic_transliteration#Transliteration_standards
############
While I have no idea as to why Lamia wants to do this, I know why I
do - I want to make it convenient for a specific group of re-
translators to work with texts originally in Arabic (or Farsi) but
already translated into English, when they do not know Arabic (or
Farsi) but could use Arabic (or Farsi) to English dictionaries and/or
corpora to enhance their efforts, if only they were able to read the
script used to represent the former.
############
Many thanks for your message.
> Why do you want to transliterate your Arabic text to the Latin
> alphabet?
>
Generally when we submit a research paper (written in English/
French) on
Arabic NLP
we are asked to present the latin-character transliteration for the
arabic
sentences !
############
For this, I suggest the Buckwalter transliteration is not the best
solution, as it maps some Araic letters to ASCII characters whcih are
not
roman alphabet letters, making the transcription simple to process but
hard for humans to read. I suggest the reviewers want a
non-Arabic-speaker to have some idea of what the text sounds like,
so an approximate phonetic transcription would probably suffice -
eg see the examples (pages 6, 8, ...) in
Al-Sulaiti, Latifa; Atwell, Eric. The design of a corpus of contemporary
Arabic. International Journal of Corpus Linguistics, vol. 11, pp.
135-171. 2006.
############
I wholeheartedly agree with Eric: Buckwalter transliteration is
usually not appropriate for a published paper because only a
handful of people working in Arabic NLP can read it fluently. The
appropriate transliteration or transcription depends on the
expected audience -- Arabists/Orientalists would probably be more
comfortable reading a ZDMG-style transcription, whereas general
linguists would probably prefer one based on IPA. The following
page has a table comparing the two:
http://en.wikipedia.org/wiki/DIN-31635
############
I suggest the Arabic Unified Transliteration (AUT) at the adress :
http://fr.wikipedia.org/wiki/Utilisateur:Qalandariyy/Translitt%C3%
A9ration_arabe_unifi%C3%A9e
It can be easily read by Arabic as well as non-Arabic speakers ...
In addition, the author gives a comparaison of about ten different
transliteration systems...
NB: The paper is written in French
############
Hi, there are many different types of transliterations for Arabic
that can serve different purposes. Often a phonetic "transcription"
is good enough for linguistic papers; however, discussions of Arabic
orthographic peculiarities are much harder without a one-to-one
"transliteration" that romanizes the Arabic alphabet. There is a
chapter in a new book published this year in which a proposal is
made for a transliteration for Arabic (that addresses most of the
issues people have with Buckwalter's transliteration). This
approach was used in all the papers in that book:
Habash, Nizar, Abdelhadi Soudi and Tim Buckwalter. On Arabic
Transliteration. In Arabic Computational Morphology:
Knowledge-based and Empirical Methods. Soudi, Abdelhadi; van den
Bosch, Antal; Neumann, Günter (Eds.), 2007. ISBN: 978-1-4020-6045-8
Online:
http://www.nizarhabash.com/publications/chapter2BisHabash_et_al-2007-
web.doc
############
I am unable to read the Arabic characters in this particular document
- can anyone make a PDF of it?
############
Sorry, Here is the link to a pdf version:
http://www.nizarhabash.com/publications/chapter2BisHabash_et_al-2007-
web.pdf
############
Dear Corpora friends,
There has been some resistance to using the Buckwalter transliteration
in NLP because some of the characters interfere with XML and regular
expressions, or are just too cryptic: ` * < > | & }. Dil Parkinson at
BYU avoids these problems in his own transliteration scheme, which
replaces the above problematic characters with alphabetic ones (but
remains somewhat cryptic, in my opinion). Some modifications of the
Buckwalter transliteration scheme replace the problematic characters
with digits. Additional modifications have been made for representing
characters outside of the basic Arabic character set, such as Persian
characters. The introduction of digraphs by the Archimedes Project
research team at Harvard
(http://archimedes.fas.harvard.edu/docs/Arabic/) is an interesting
modification of the Buckwalter transliteration, because some systematic
use of digraphs might be needed for transliterating languages that use
Arabic characters outside the basic range. Some arbitrary mapping of
digits or letters to Arabic characters is inevitable, but the goal is
simply to represent unambiguously how the language is written, allowing
for one-to-one mapping to Unicode Arabic and back. When I developed my
Arabic transliteration system (with Ken Beesley at Alpnet in Provo,
Utah, 1989), we needed to represent native Arabic orthography with a
Latin-based scheme that was easy to input on ordinary keyboards, that
used upper- and lower-case characters, but no accented (upper ASCII)
characters. In other words, we needed a 7-bit representation of the
Arabic writing system that would be suitable for NLP, especially in
contexts where native Arabic characters could not be easily input or
displayed (especially with bi-directional issues), or where non-Arabists
needed to read and make some sense of Arabic text data. I feel that in
many NLP publications where the focus is not necessarily a discussion of
Arabic orthography, IPA or modified LC transliteration would be more
suitable, but note that most of these schemes require use of upper-ASCII
(8-bit), Latin Extended, or even Greek characters. Although this can
work well in printed publications (such as the recently published book
that Nizar mentioned), this kind of data does not travel well by e-mail
or across platforms, nor as safely as 7-bit data. In any case, 7-bit
data is easy to input on all platforms.
Finally, there is a very nice online utility created by Ota Smrz for
converting among several Arabic transliteration schemes:
http://ufal.mff.cuni.cz/cgi-bin/smrz/Encode/Arabic/index.fcgi
-- Tim Buckwalter
------------------------------------------------------------------------
--
End of Arabic-L: 24 Oct 2007
More information about the Arabic-l
mailing list