Arabic-L:LING:Trial Release of International Corpus of Arabic

Dilworth Parkinson dilworthparkinson at GMAIL.COM
Wed Feb 5 18:45:05 UTC 2014


------------------------------------------------------------------------
Arabic-L: Wed 05 Feb 2014
Moderator: Dilworth Parkinson <dilworth_parkinson at byu.edu>
[To post messages to the list, send them to arabic-l at byu.edu]
[To unsubscribe, send message from same address you subscribed from to
listserv at byu.edu with first line reading:
           unsubscribe arabic-l                                      ]

-------------------------Directory------------------------------------

1) Subject: Trial Release of International Corpus of Arabic

-------------------------Messages-----------------------------------
1)
Date: 05 Feb 2014
From: Sameh Alansary <Sameh.Alansary at bibalex.org>
Subject: Trial Release of International Corpus of Arabic

Dear Arabic List,

I would like to announce the first phase of the trial release of the
International Corpus of Arabic (ICA).  ICA attempts to build a
representative corpus of the Arabic language as it is used all over the
Arab world, with the aim of supporting research on linguistics in general
and on Arabic computational linguistics in particular. The ICA is planned
to contain 100 million words, the current situation demonstrates that
almost 80% of the corpus is accomplished. ICA may represent a systematic
investigation of national varieties within the Arabic speaking community as
far as MSA is concerned. We hope that this should prove very useful for
linguists who believe that their theories and descriptions of language
should be based on real, rather than contrived, data.

In collecting ICA, the main focus was to cover the same genres from
different sources and from all around the Arab world. Hence, the ICA covers
numerous sources (Newspapers, web articles, books.. etc.) and numerous
genres (Literature, Politics, Science Arts, Sports .. etc.).


Currently, this stage includes the morphological analysis of each word
within the corpus, where the analysis lists number of information such as
Prefix(s), Suffix(s), Word Class, Stem, Lemma, Root, Stem Pattern as well
as Number, Gender and Definiteness according to the different contexts of
the words within the corpus. All such information will be used in the
search.

I would like to invite all of you who are interested, to use the corpus at
www.bibalex.org/ica<http://www.bibalex.org/ica> , and of course we welcome
your comments for improvements to the corpus official email address
ica at bibalex.org<mailto:ica at bibalex.org>
Best Regards,
Sameh Alansary, Professor,
Arabic Computational Linguistic Center, Director,
Bibliotheca Alexandrina,
P.O.Box 138,
21526 El Shatby, Alexandria,
Tel: +20-3-4839999 Ext. 2788
Fax: +20-3-4820405
E-mail: sameh.alansary at bibalex.org<mailto:sameh.alansary at bibalex.org>

--------------------------------------------------------------------------
End of Arabic-L: 05 Feb 2014
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/arabic-l/attachments/20140205/68b872a5/attachment.htm>


More information about the Arabic-l mailing list