25.592, Software: Computational Linguistics, Text/Corpus Linguistics: International Corpus of Arabic

linguist at linguistlist.org linguist at linguistlist.org
Wed Feb 5 19:18:09 UTC 2014


LINGUIST List: Vol-25-592. Wed Feb 05 2014. ISSN: 1069 - 4875.

Subject: 25.592, Software: Computational Linguistics, Text/Corpus Linguistics: International Corpus of Arabic

Moderator: Damir Cavar, Eastern Michigan U <damir at linguistlist.org>

Reviews: 
Monica Macaulay, U of Wisconsin Madison
Rajiv Rao, U of Wisconsin Madison
Joseph Salmons, U of Wisconsin Madison
Mateja Schuck, U of Wisconsin Madison
Anja Wanner, U of Wisconsin Madison
       <reviews at linguistlist.org>

Homepage: http://linguistlist.org

Do you want to donate to LINGUIST without spending an extra penny? Bookmark
the Amazon link for your country below; then use it whenever you buy from
Amazon!

USA: http://www.amazon.com/?_encoding=UTF8&tag=linguistlist-20
Britain: http://www.amazon.co.uk/?_encoding=UTF8&tag=linguistlist-21
Germany: http://www.amazon.de/?_encoding=UTF8&tag=linguistlistd-21
Japan: http://www.amazon.co.jp/?_encoding=UTF8&tag=linguistlist-22
Canada: http://www.amazon.ca/?_encoding=UTF8&tag=linguistlistc-20
France: http://www.amazon.fr/?_encoding=UTF8&tag=linguistlistf-21

For more information on the LINGUIST Amazon store please visit our
FAQ at http://linguistlist.org/amazon-faq.cfm.

Editor for this issue: Andrew Lamont <alamont at linguistlist.org>
================================================================  


Date: Wed, 05 Feb 2014 14:17:40
From: Sameh Alansary [sameh.alansary at bibalex.org]
Subject: Computational Linguistics, Text/Corpus Linguistics: International Corpus of Arabic

E-mail this message to a friend:
http://linguistlist.org/issues/emailmessage/verification.cfm?iss=25-592.html&submissionid=27191736&topicid=13&msgnumber=1
 I would like to announce the release of the first phase of the International Corpus of Arabic (ICA).  ICA attempts to build a representative corpus of the Arabic language as it is used all over the Arab world, with the aim of supporting research on linguistics in general and on Arabic computational linguistics in particular. The ICA is planned to contain 100 million words, the current situation demonstrates that almost 80% of the corpus is accomplished. ICA may represent a systematic investigation of national varieties within the Arabic speaking community as far as MSA is concerned. We hope that this should prove very useful for linguists who believe that their theories and descriptions of language should be based on real, rather than contrived, data.

In collecting ICA, the main focus was to cover the same genres from different sources and from all around the Arab world. Hence, the ICA covers numerous sources (Newspapers, web articles, books, (etc.) and numerous genres (literature, politics, science, arts, sports, etc.).

Currently, this stage includes the morphological analysis of each word within the corpus, where the analysis lists number of information such as Prefix(s), Suffix(s), Word Class, Stem, Lemma, Root, Stem Pattern as well as Number, Gender, and Definiteness according to the different contexts of the words within the corpus. All such information will be used in the search.

I would like to invite all of you who are interested, to use the corpus at www.bibalex.org/ica , and of course we welcome your comments for improvements to the corpus official email address ica at bibalex.org

Best Regards,
Sameh Alansary, Professor,
Arabic Computational Linguistic Center, Director,
Bibliotheca Alexandrina,
P.O.Box 138,
21526 El Shatby, Alexandria,
Tel: +20-3-4839999 Ext. 2788
Fax: +20-3-4820405
E-mail: sameh.alansary at bibalex.org

Linguistic Field(s): Computational Linguistics
                     Text/Corpus Linguistics

Subject Language(s): Arabic, Standard (arb)



----------------------------------------------------------
LINGUIST List: Vol-25-592	
----------------------------------------------------------



More information about the LINGUIST mailing list