[Corpora-List] Part of Speech annotation of Persian and Urdu corpora

Ben Allison B.Allison at dcs.shef.ac.uk
Wed Feb 27 11:44:36 UTC 2008


Bushra,

I'm not sure whether you want human-annotated text from which to induce 
a tagger, or are interested in having a working POS tagger itself. If 
the latter, then about a year ago we tracked down a 10 million word 
corpus of Persian which had been hand-annotated, and induced a tagger 
from the 1 million word part that the creators were prepared to give 
away for research purposes. The tagset they used (which they created for 
the job) could be interpreted on two levels -- there was a coarse tagset 
of 14 tags with categories like Noun, Verb, etc. and a much finer one 
which I believe ran to about 150 tags. Accuracies were pretty good -- 
over 98% for coarse tags, and around 92% for the fine ones.

I'm not sure if you're prepared for a DIY approach, but I suspect that 
if you are, you could get hold of the corpus we used (I can pass you 
contact information) and use one of many trainable taggers to induce 
your own. Of course, this might not be what you were thinking of...

Ben

hfaili at ece.ut.ac.ir wrote:
> Dear Bushra,
> I am working in an Iranian Company (named Douran www.douran.com) which
> have a good experience and a tools for POS tagging, and other NLP fields
> in Persian...
> for more information contact me via hfaili at douran.com
> regards
>
> hello
> I was wondering if anybody knows of any companies or individual linguists
> who would do Part of Speech annotation of Persian and Urdu corpora?
>
> Thank you
> Bushra Zawaydeh
>
> ********************************************************************
> Bushra Zawaydeh                           bushraz at basistech.com
> Senior Linguist
> Basis Technology                           Tel: (617)386-7130
> One Alewife Center                         Fax: (617)386-2020
> Cambridge, MA 02140-2327
> USA
> **********************************************************************
>
>
> --------------------------------------------------------------------------------
> Helping your favorite cause is as easy as instant messaging. You IM, we
> give. Learn more.
>
> __________ NOD32 2853 (20080206) Information __________
>
> This message was checked by NOD32 antivirus system.
> http://www.eset.com
>
>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>   

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list