[Corpora-List] Pashto (was: Which Statistical Test is Suitable)

Mike Maxwell maxwell at umiacs.umd.edu
Sun Jul 17 19:36:04 UTC 2011


On 7/16/2011 1:05 AM, fatima zuhra wrote:
>> You're referring to the two Unicode characters Arabic Kaaf (U+643)
>> and Arabic Letter Keheh (U+06A9, also commonly called Kaf or Kaaf), right?
>
> I am referring to variations e.g. آئينـــــــــهand آئينه. It is a
> single word (meaning “mirror”), written in two styles. In the first
> occurrence, the second-last grapheme is made longer. In the similar way,
> “kaaf”, “baa”, “meem” and many more graphemes are sometimes written
> longer and sometimes shorter. For software, these are two different words.

In Unicode, these differences are encoded by the addition of the tatweel 
(= kashida, U+0-640); the remaining graphemes are unchanged (except 
perhaps for where they visually join with this character).  It should be 
sufficient for software to ignore this character, or to remove it from 
the text before the software sees it.  It's the analog of removing a 
dash+newline in processing Roman script languages.

The tatweel was also used in the old ISO 8859-6 8-bit encoding system.
-- 
	Mike Maxwell
	maxwell at umiacs.umd.edu
	"My definition of an interesting universe is
	one that has the capacity to study itself."
         --Stephen Eastmond

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list