21.5228, Sum: Pashto in Unicode

linguist at LINGUISTLIST.ORG linguist at LINGUISTLIST.ORG
Thu Dec 23 01:23:36 UTC 2010


LINGUIST List: Vol-21-5228. Wed Dec 22 2010. ISSN: 1068 - 4875.

Subject: 21.5228, Sum: Pashto in Unicode

Moderators: Anthony Aristar, Eastern Michigan U <aristar at linguistlist.org>
            Helen Aristar-Dry, Eastern Michigan U <hdry at linguistlist.org>
 
Reviews: Monica Macaulay, U of Wisconsin-Madison  
Eric Raimy, U of Wisconsin-Madison  
Joseph Salmons, U of Wisconsin-Madison  
Anja Wanner, U of Wisconsin-Madison  
       <reviews at linguistlist.org> 

Homepage: http://linguistlist.org/

The LINGUIST List is funded by Eastern Michigan University, 
and donations from subscribers and publishers.

Editor for this issue: Danielle St. Jean <danielle at linguistlist.org>
================================================================  

To post to LINGUIST, use our convenient web form at
http://linguistlist.org/LL/posttolinguist.cfm.

===========================Directory==============================  

1)
Date: 20-Dec-2010
From: Ron Artstein [linguist at artstein.org]
Subject: Pashto in Unicode
 

	
-------------------------Message 1 ---------------------------------- 
Date: Wed, 22 Dec 2010 20:20:54
From: Ron Artstein [linguist at artstein.org]
Subject: Pashto in Unicode

E-mail this message to a friend:
http://linguistlist.org/issues/emailmessage/verification.cfm?iss=21-5228.html&submissionid=3796296&topicid=10&msgnumber=1
  


Query for this summary posted in LINGUIST Issue: 21.2971                                                                                                                                               
 

Hi,

Several months ago I posted a query, asking whether there are 
standards for encoding the various Pashto y-characters in Unicode. I 
received many helpful responses on this list and the Unicode list 
(special thanks to Wilma Heston, Kamal Mansour and Roozbeh 
Pournader), and also met personally with several Pashto speakers in 
Southern California. The short answer is that there is a proposed 
standard but it is often not followed in actual electronic texts, partly due 
to inherent problems with the standard itself. So processing needs to 
be done with care. 

The long answer will give the details of the proposed standard at the 
end, but to make sense of it we will need to look at the history of these 
characters, both in the development of the Pashto as an adaptation of 
the Arabic and Persian scripts, and in the later encoding of these 
scripts into computer character sets.

Terminological notes: I will refer to characters and character bases by 
their Unicode name, allowing me to sidestep transliteration issues and 
the fact that the characters are known by different names in Arabic, 
Persian and Pashto.

1. Arabic

Each Arabic character consists of a base form and (possibly) a set of 
dots or other marks, whose use is compulsory in contemporary writing. 
Historically the dots developed as a way to disambiguate base forms 
that had become too similar, and they are distinct from a separate set 
of optional vowel diacritics. The Arabic script is cursive, and most 
characters are connected to the preceding and following characters 
within the word (though some character bases do not connect); 
consequently, each character has up to 4 shapes -- initial (connected 
only to the following character), medial (connected on both sides), final 
(connected only to the preceding character), and isolated. Often the 
various shapes are similar, but for some bases they look very different. 
For the yeh base, the initial and medial forms look very similar, but they 
are distinct from the isolated and final forms which are also similar to 
each other. When I talk about medial and final yeh forms I intend to 
cover also the initial and isolated forms, respectively.

While there has been (and continues to be) debate about what exactly 
constitutes a character in various derivatives of the Arabic script, the 
identity of characters used for writing the Arabic language follows a 
grammatical tradition of over a thousand years. This tradition 
recognizes 3 characters with a yeh base:

yeh: a yeh base with two dots below, used to represent the /j/ and /i:/ 
sounds. The standard arrangement of the two dots is horizontal, but 
they can be placed vertically or diagonally with no change in meaning. 
In Egypt, the final form is written without dots.

alef maksura: a yeh base with no dots, used historically to represent 
long /a:/ in certain contexts; in contemporary writing it is used only at 
the end of a word for certain short /a/ which derive from an 
etymological /j/.

yeh with hamza above: a yeh base with a hamza character above (or, 
in some historical texts, below), representing a glottal stop in certain 
contexts (typically before or after the vowel /i/).

Encoding of Arabic for text processing predates computers, going back 
to 5-bit teletype codes. However, these codes, as well as early 
computer codes, were all proprietary and did not allow interoperability 
across systems. Some of these systems had separate codes for initial, 
medial and final forms where the shapes differed significantly. The first 
documented and accepted standard for encoding Arabic characters 
was ASMO-449 (1982), a 7-bit code based on ASCII, with Arabic 
characters occupying the space of Latin lower-case letters. This code 
established the principle that each (traditional) character had exactly 
one code point, with selection of the appropriate contextual glyph done 
by software and not represented in the characters themselves. ASMO-
449 has 3 code points for yeh-based characters: yeh at 0x6A, alef 
maksura at 0x69, and yeh with hamza above at 0x46. Later 8-bit codes 
from the 1980s such as ECMA-114, ASMO-708, and ISO-8859-6 used 
the same 3 code points, transposed to 0xEA, 0xE9 and 0xC6 by 
turning on the eighth bit. The same code points found their way to 
Unicode starting at version 1.0 (1991) as U+064A Arabic Letter Yeh, 
U+0649 Arabic Letter Alef Maksura, and U+0626 Arabic Letter Yeh 
with Hamza Above.

The state of yeh-based characters in Arabic is rather straightforward, 
except in Egypt. As mentioned above, convention in Egypt is to write 
yeh in final position as a base form without dots, which makes it look 
identical to alef maksura. Moreover, since contemporary texts only use 
alef maksura at the end of a word, writing in Egypt has no need to 
distinguish between yeh and alef maksura, so confusion arises. The 
Egyptian daily Al-Ahram, for example, uses the character U+064A Yeh 
for both yeh and alef maksura in its online edition 
(http://www.ahram.org.eg/) and the result is that final yeh looks non-
Egyptian because of the dots, and alef maksura looks simply incorrect. 
In the print edition the characters appear correctly, with no dots on the 
final forms, presumably through the use of proprietary fonts.

2. Persian

The Persian script is an adaptation of the Arabic script. Native Persian 
vocabulary uses just one yeh-based character, representing the /j/ and 
/i:/ sounds; additionally, alef maksura and yeh with hamza above are 
used in loanwords from Arabic. The convention for writing the yeh 
character in Persian is the same as for Arabic in Egypt: two dots in 
medial form, none in final form. Thus, Persian does not distinguish 
between yeh and alef maksura.

The first standard 8-bit code for Persian with contextual rendering of 
characters was ISIRI-3342 (1993), which replaced a previous standard 
with separate codes for distinct contextual shapes. ISIRI-3342 has a 
yeh character in position 0xE1, which is displayed without dots on the 
final form. ISIRI-3342 also includes yeh with hamza above in position 
0xFB as well as a character called "Arabic yeh" with dots on the final 
form in position 0xFE; an annotation specifies that the latter two 
characters are taken from ISO-8859-6. There is no specific character 
for alef maksura.

The yeh character from ISIRI-3342 corresponds to Unicode character 
U+06CC Arabic Letter Farsi Yeh, which appears already in Unicode 
version 1.0 (1991), two years prior to the publication of ISIRI-3342. 
Character U+06CC carries an explicit annotation: "Initial and medial 
forms of this letter have dots". I have not found documentation on why 
Unicode and ISIRI decided to give separate code points to the Arabic 
and Persian conventions of writing yeh. It is not clear if an actual need 
exists to use both conventions in a single document, because when 
Persian names or terms are written in an Arabic document or vice 
versa, it is common practice to write the yeh according to the 
conventions of the document language rather than the source 
language. At any rate, presently Unicode contains the following three 
characters which encode versions of yeh with and without dots:

U+0649 no dots medially or finally
U+064A two dots medially and finally
U+06CC two dots medially, none finally

The intention behind these codes is probably to use U+0649 and 
U+064A for alef maksura and yeh in Arabic, and U+06CC for yeh in 
Persian; it is not clear what the intention is for Arabic in Egypt, or for 
alef maksura in Persian words of Arabic origin. Things are more 
complicated in practice. The online edition of Hamshahri newspaper 
(http://www.hamshahrionline.ir/) uses U+06CC regularly, though stray 
occurrences of U+064A are also found; in contrast, the online edition of 
Kayhan (http://kayhannews.ir/) uses U+064A exclusively, resulting in 
inappropriate dots on all final yeh forms (as with Al-Ahram in Egypt, 
these dots are absent from the print edition, again probably due to 
proprietary fonts). Online forums in Persian such as 
(http://balatarin.com/) show a mixture of U+06CC and U+064A.

3. Pashto

The Pashto script is an adaptation of the Persian-Arabic script; it 
shares some non-Arabic characters with Persian but differs on others 
(for example the sound /g/, not represented in the Arabic script, is 
written by different modifications of the kaf character base in Persian 
and Pashto). Traditionally, Pashto used a single yeh character with the 
same convention as in Persian, of two dots in the medial form and none 
on the final form, with no significance attached to the visual 
arrangement of the dots. This character was 3-ways ambiguous 
between the sounds /j/, /i:/ and /e/. Some informants I met with who had 
left Afghanistan and Pakistan in the 1980s are not familiar with any 
distinction among yeh characters, and while they tend to write final yeh 
without dots, they also accept it with dots. However, recent 
developments have caused some differentiation (Wilma Heston 
suggests that this came from some conferences organized in the early 
1990s by the Pashto Academy at the University of Peshawar, Pakistan; 
I was not able to find documentation on this effort, though reference to 
"a 1991 meeting of Pashto experts in Peshawar" is made in the UNDP 
document cited below).

One convention that has gained fairly wide acceptance is a distinction 
between a horizontal arrangement of the dots, representing /j/ or /i:/ as 
in Arabic and Persian, and a vertical arrangement representing the 
sound /e/. This distinction is the same as in Uighur, and the character 
with vertical dots is codified as U+06D0 Arabic Letter E. Additional 
conventions concern the sound /j/ following schwa in final position, 
represented as U+0626 yeh with hamza above when it is used as the 
2nd person plural verb inflection, and as U+06CD Arabic Letter Yeh 
with Tail when it is used to represent the feminine noun and adjective 
inflection. This four-way distinction is used, for example, in the following 
book: Habibullah Tegey and Barbara Robson, A Reference Grammar 
of Pashto, Center for Applied Linguistics, Washington DC, 1996 
(http://www.eric.ed.gov/ERICWebPortal/detail?accno=ED399825) 
(unfortunately the PDF is a scan of a printout, so I can only identify the 
characters by their visual shape, but I list them with the most likely 
corresponding Unicode characters; the final form of the j/i character is 
usually without dots, but sometimes with).

U+06CC or U+064A /j/ and /i:/
U+06D0 /e/
U+06CD /j/ after schwa in final position, feminine marker
U+0626 /j/ after schwa in final position, 2nd person plural marker

A five-way distinction is offered by M. A. Zyar, A Guide of Standard 
Pashto, Oxford, 2006 
(http://www.tolafghan.com/assets/download/pashto_liklar.pdf). The 
book itself is in Pashto which I can't read, but on Page 387 it spells out 
the same usage as Tegey and Robson above, with an additional 
distinction: a final form with dots is used for /i:/, while a final form 
without dots is used for the masculine nominal marker /aj/. Word-
medially, both /j/ and /i:/ are represented with two dots in a horizontal 
arrangement.

Zyar does not specify a computer encoding for these characters; the 
book itself contains a mix of U+0649, U+064A and U+06CC, suggesting 
that the producers of the book cared only about the visual shape of the 
glyphs, not the machine encoding. The same five-way distinction of yeh 
shapes with a recommended encoding (but no phonetic 
characterization) is specified in the document Computer Locale 
Requirements for Afghanistan, published in 2003 by the UNDP 
(http://www.evertype.com/standards/af/af-locales.pdf), seen below: 

U+06CC medial forms with dots (/i:/ and /j/) and dotless final form (/j/)
U+064A final form with dots (/i:/)

The rationale is given in a note on page 5 (<> indicates places where 
the document has a specific Pashto glyph): "Since the shapes of the <> 
initial and <> medial forms of the Pashto letters <> ye (U+06CC) and <> 
saxta ye (U+064A) are exactly the same, to avoid encoding ambiguities 
in Pashto data ... we recommend that the Unicode character for saxta 
ye, namely <> U+064A, never be used in initial and medial forms in 
Pashto data".

The same convention is followed in other on-line resources as well as a 
proprietary electronic lexicon, so it can be considered to be the 
preferred or standard encoding. However, this is not the only 
convention found. For example, the Wikipedia article 
(http://en.wikipedia.org/wiki/Pashto_alphabet) makes a visual 
distinction:

U+064A forms with dots (/i:/ and /j/ medially, /i:/ finally)
U+0649 forms without dots (only /j/ in word-final position)

Electronic documents in the wild such as BBC News 
(http://www.bbc.co.uk/pashto) and Deutsche Welle 
(http://www.dw-world.de/) show great confusion, with U+064A and 
U+06CC used interchangeably even within a single article, and even in 
final position where the glyphs differ.

4. Concluding Remarks

I believe that the confusion in the use of Pashto yeh characters is 
inherent to the design of the script, namely the fact that /i:/ and /j/ are 
distinguished in the final form but not in the medial form. This already is 
a difficult concept, and I cannot think of another case where a single 
language written in a modern derivative of the Arabic script uses two 
distinct characters that have an identical appearance in some positions 
but look different in others. 

The existence of three unicode characters, representing different 
combinations of dots on medial and final yeh, gives what appears at 
first glance to be a linguistically elegant alternative to the encoding of 
/i:/ and /j/ in Pashto:

U+064A: /i:/ (with dots medially and finally)
U+06CC: /j/ (with dots medially, without finally)

However, this encoding is impractical, since it is not reasonable to 
expect typists to make a distinction between characters that have 
identical shape. This point is illustrated by the complete merger in 
visual form between alef maksura and yeh in Persian and Arabic 
written in Egypt: while authors and typists presumably make a 
conceptual distinction between the characters (since they are 
pronounced differently), the fact that the characters look the same 
means that people do substitute one for the other while typing. A future 
spelling reform in Pashto may hopefully either fully split /i:/ and /j/ in all 
contextual positions, or fully merge them. Until then, we will have to live 
with confusion in electronic documents.

- Ron 

Linguistic Field(s): Writing Systems




-----------------------------------------------------------
LINGUIST List: Vol-21-5228	
----------------------------------------------------------


	



More information about the LINGUIST mailing list