[Corpora-List] Pashto (was: Which Statistical Test is Suitable)

Sat Jul 16 05:05:25 UTC 2011

> You're referring to the two Unicode characters Arabic Kaaf (U+643) and Arabic Letter Keheh (U+06A9, also commonly called Kaf or Kaaf), right?

I am referring to variations e.g. آئينـــــــــه and آئينه. It is a single word (meaning “mirror”), written in two styles. In the first occurrence, the second-last grapheme is made longer. In the similar way, “kaaf”, “baa”, “meem” and many more graphemes are sometimes written longer and sometimes shorter. For software, these are two different words.

> I would guess you've also observed lots of variation in the various yehs, right?  Arabic yeh, Farsi yeh, yeh with tail,...

Yes. The same word is usually written with variation in “yehs”. The data I extracted contain frequent examples of this variation e.g. آبادى and آبادي that mean “population”. Both are the variations of a single word. 

> Do you know of any corpora that deal with Pashto spelling variation? For instance, a bitext with found spellings aligned with "correct" spellings.  

In my knowledge, there is no Pashto corpus that deals with Pashto spelling variations. I and my Ph.D. supervisor have been working on Pashto corpora since 2006. I used a corpus containing 1.225 million words Pashto text, developed by Mohammad Abid Khan (my Ph.D. supervisor) and me (work regarding this corpus was presented in Corpus Linguistics 2009). That is, however, not an aligned corpus. I extracted words from the corpus and then I observed a lot of spelling variations.    

Regards.

Fatima Tuz Zuhra
Ph.D. Scholar and Lecturer,
Department of Computer Science,
University of Peshawar, Pakistan.

--- On Thu, 7/14/11, Mike Maxwell <maxwell at umiacs.umd.edu> wrote:

From: Mike Maxwell <maxwell at umiacs.umd.edu>
Subject: Pashto (was: Which Statistical Test is Suitable)
To: "fatima zuhra" <fateeshah at yahoo.com>
Cc: corpora at uib.no
Date: Thursday, July 14, 2011, 9:04 AM

On 7/13/2011 11:40 PM, fatima zuhra wrote:
> One of my works was concerned with extracting individual words from a
> written Pashto corpus. The system I used for extracting individual
> Pashto words gave me such variations of the same word that looked the
> same at the first glance (e.g. the grapheme "kaaf" may be written a bit
> longer than how it is written currently in the Urdu spelling of "Shakir"
> in your name, which will result in a variation of this spelling). Are
> you considering these variations or some others?

You're referring to the two Unicode characters Arabic Kaaf (U+643) and Arabic Letter Keheh (U+06A9, also commonly called Kaf or Kaaf), right?

I would guess you've also observed lots of variation in the various yehs, right?  Arabic yeh, Farsi yeh, yeh with tail,...

Do you know of any corpora that deal with Pashto spelling variation? For instance, a bitext with found spellings aligned with "correct" spellings.  I'm not sure what "correct" spelling would mean in this context, but perhaps the spelling according to some dictionary (of course allowing for the various inflected forms of words).
--     Mike Maxwell
    maxwell at umiacs.umd.edu
    "My definition of an interesting universe is
    one that has the capacity to study itself."
        --Stephen Eastmond
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110715/77d8858b/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora