[Corpora-List] Corpus Development

fatima zuhra fateeshah at yahoo.com
Sun Apr 27 04:56:13 UTC 2008


Dear Hardie,
   
  Thanks for your e-mail having valuable suggestions for me. I'll indeed act on your advice to enhance the corpus. Well, I have been working with Xaira for a few days, and I have found that a very useful tool. 
   
  Well Sir, I would like to ask, what were the factors due to which you preferred the use of e.g. SQL for larger corpora i.e. in case of Urdu, Nepali etc? What do you say, isn't XML better for larger corpora? If not, then why Sir? 
   
  Regards.
   
  


"Hardie, Andrew" <a.hardie at lancaster.ac.uk> wrote:
      Dear Fatima,
   
  I am sure others will have responded to your queries, but I thought I'd add my voice. For the kind of data you describe, Xaira is indeed a good option. the web addresses you need are:
   
  http://www.oucs.ox.ac.uk/rts/xaira/
http://www.natcorp.ox.ac.uk/tools/
http://sourceforge.net/projects/xaira/
http://xaira.sourceforge.net/
   
  However, when you have a larger corpus, you might also consider whether a web-accessible solution (e.g. one based on an SQL database) would be more convenient. I have found this to be the case when working with corpora of Urdu, Nepali, Sinhala etc.
   
  In terms of your future research, I would recommend working primarily on expanding your corpus. 30,000 words is not a lot of data in corpus terms. You will find, I think, that effort spent enhancing your corpus collection will be much more fruitful than developing software, especially given how much ready-made corpuys analysis software is freely available.
   
  best regards,
   
  Andrew Hardie.
   
        Andrew Hardie
  Linguistics & English Language
  Bowland College
  Lancaster University
  Lancaster LA1 4YT
  United Kingdom
   
  www.ling.lancs.ac.uk/staff/hardie




    
---------------------------------
  From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of fatima zuhra
Sent: 19 April 2008 03:25
To: Corpora at uib.no
Subject: [Corpora-List] Corpus Development


  
  Hi All,
   
  Thanks a lot to all, who paid attention to my message and provided me with their valuable suggestions.
   
  Dear Laxmi, my corpus is a general-purpose corpus of written Pashto. Dear Mr. Adam, the corpus currently contains 30,000 words and its size is increasing.  I haven't used Xiara, but am interested in using it. Dear Lou, I'll be too much thankful to you if you help me further by forwarding me some guidelines about Xiara. The web page http://www.xaira.net/  cannot be displayed in my browser. 
   
  Dear Gee Raza, I am also glad to see someone from Pakistan on the list. Well, I only know the three languages, you have mentioned, but am interested in learning Arabic and Persian. I hope I'll soon learn these two.
   
  Dear Oliver, I meant to ask that am I going in a right direction for a general-purpose Pashto corpus? By fully functional, I mean something that can be rightly called a corpus. I also wanted to investigate the appropriate statistical measures, which can be used for the evaluation of any newly developed software. In our country, there are statisticians, who know each and every statistical measure, but cannot guide us which one to use for which purpose. If there are some, who can guide, we do not have access to them.
   
  Thanks to Sir Ramesh for his encouragement and valuable suggestions.
   
  I have also developed a finite state morphological analyzer for Pashto. I will provide the details from time to time. 
   
  Regards.
    
---------------------------------
  Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now. 

       
---------------------------------
Be a better friend, newshound, and know-it-all with Yahoo! Mobile.  Try it now.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20080426/cfa22480/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list