[Corpora-List] Pashto (was: Which Statistical Test is Suitable)

Mike Maxwell maxwell at umiacs.umd.edu
Thu Jul 14 04:04:21 UTC 2011


On 7/13/2011 11:40 PM, fatima zuhra wrote:
> One of my works was concerned with extracting individual words from a
> written Pashto corpus. The system I used for extracting individual
> Pashto words gave me such variations of the same word that looked the
> same at the first glance (e.g. the grapheme "kaaf" may be written a bit
> longer than how it is written currently in the Urdu spelling of "Shakir"
> in your name, which will result in a variation of this spelling). Are
> you considering these variations or some others?

You're referring to the two Unicode characters Arabic Kaaf (U+643) and 
Arabic Letter Keheh (U+06A9, also commonly called Kaf or Kaaf), right?

I would guess you've also observed lots of variation in the various 
yehs, right?  Arabic yeh, Farsi yeh, yeh with tail,...

Do you know of any corpora that deal with Pashto spelling variation? For 
instance, a bitext with found spellings aligned with "correct" 
spellings.  I'm not sure what "correct" spelling would mean in this 
context, but perhaps the spelling according to some dictionary (of 
course allowing for the various inflected forms of words).
-- 
	Mike Maxwell
	maxwell at umiacs.umd.edu
	"My definition of an interesting universe is
	one that has the capacity to study itself."
         --Stephen Eastmond

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list