[Corpora-List] Pashto (was: Which Statistical Test is Suitable)
Mike Maxwell
maxwell at umiacs.umd.edu
Thu Jul 14 04:04:21 UTC 2011
On 7/13/2011 11:40 PM, fatima zuhra wrote:
> One of my works was concerned with extracting individual words from a
> written Pashto corpus. The system I used for extracting individual
> Pashto words gave me such variations of the same word that looked the
> same at the first glance (e.g. the grapheme "kaaf" may be written a bit
> longer than how it is written currently in the Urdu spelling of "Shakir"
> in your name, which will result in a variation of this spelling). Are
> you considering these variations or some others?
You're referring to the two Unicode characters Arabic Kaaf (U+643) and
Arabic Letter Keheh (U+06A9, also commonly called Kaf or Kaaf), right?
I would guess you've also observed lots of variation in the various
yehs, right? Arabic yeh, Farsi yeh, yeh with tail,...
Do you know of any corpora that deal with Pashto spelling variation? For
instance, a bitext with found spellings aligned with "correct"
spellings. I'm not sure what "correct" spelling would mean in this
context, but perhaps the spelling according to some dictionary (of
course allowing for the various inflected forms of words).
--
Mike Maxwell
maxwell at umiacs.umd.edu
"My definition of an interesting universe is
one that has the capacity to study itself."
--Stephen Eastmond
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list