<html><head></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; color: rgb(0, 0, 0); font-size: 14px; font-family: Calibri, sans-serif; "><div><div style="font-family: Consolas; font-size: medium; "><div style="font-family: Calibri, sans-serif; font-size: 14px; "><span style="background-color: rgb(255, 255, 255); font-family: Times; font-size: medium; ">----------- </span><span style="font-family: Times; font-size: medium; background-color: rgb(255, 255, 255); ">KALIMAT a Multipurpose Arabic Corpus </span><span style="background-color: rgb(255, 255, 255); font-family: Times; font-size: medium; ">-----------</span></div><div style="font-family: Calibri, sans-serif; font-size: 14px; "><span style="font-family: Times; font-size: medium; background-color: rgb(255, 255, 255); "><br></span></div><div style="font-family: Calibri, sans-serif; font-size: 14px; "><span style="font-family: Times; font-size: medium; background-color: rgb(255, 255, 255); ">We are pleased to announce the immediate availability of KALIMAT 1.0,</span></div><div style="font-family: Calibri, sans-serif; font-size: 14px; "><span style="font-family: Times; font-size: medium; background-color: rgb(255, 255, 255); "><br></span></div><div style="font-family: Calibri, sans-serif; font-size: 14px; "><div>KALIMAT (transliteration of "Words" in Arabic) is an Arabic natural language resource that consists of: </div><div>1) 20,291 Arabic articles (18,167,183 words) </div><div>collected from the Omani newspaper Alwatan by (Abbas et al. 2011). </div><div>2) 20,291 Extractive Single-document system summaries. </div><div>3) 2,057 Extractive Multi-document system summaries. </div><div>4) 20,291 Named Entity Recognised articles. </div><div>5) 20,291 Part of Speech Tagged articles. </div><div>6) 20,291 Morphologically Analyse articles.</div><div><br></div><div>The data collection articles fall into six categories: </div><div>culture, economy, local-news, international-news, religion, and sports.</div></div><div style="font-family: Calibri, sans-serif; font-size: 14px; "><br></div><div style="font-family: Calibri, sans-serif; font-size: 14px; "><div><p class="CL2013Text" style="margin: 0cm 0cm 0.0001pt; text-align: justify; font-size: 11pt; font-family: 'Times New Roman'; "><span lang="EN-US">The process of creating KALIMAT was applied to the entire data collection (20,291 articles).</span></p><p class="CL2013Text" style="margin: 0cm 0cm 0.0001pt; text-align: justify; font-size: 11pt; font-family: 'Times New Roman'; "><span style="font-family: Calibri, sans-serif; font-size: 14px; ">Firstly, we summarised the document collection using two Arabic summarisers, Gen–Summ and Arabic </span></p><p class="CL2013Text" style="margin: 0cm 0cm 0.0001pt; text-align: justify; font-size: 11pt; font-family: 'Times New Roman'; "><span style="font-family: Calibri, sans-serif; font-size: 14px; ">Cluster-based. </span><span style="font-family: Calibri, sans-serif; font-size: 14px; ">Gen-Summ (El-Haj et al. 2010) is a single document summariser based on the VSM model </span></p><p class="CL2013Text" style="margin: 0cm 0cm 0.0001pt; text-align: justify; font-size: 11pt; font-family: 'Times New Roman'; "><span style="font-family: Calibri, sans-serif; font-size: 14px; ">(Salton et al. 1975) that takes an </span><span style="font-family: Calibri, sans-serif; font-size: 14px; ">Arabic document and its first sentence and returns an extractive </span></p><p class="CL2013Text" style="margin: 0cm 0cm 0.0001pt; text-align: justify; font-size: 11pt; font-family: 'Times New Roman'; "><span style="font-family: Calibri, sans-serif; font-size: 14px; ">summary. A number of 20,291 system summaries have been generated. </span><span style="font-family: Calibri, sans-serif; font-size: 14px; ">Cluster-based (El-Haj et al. 2011) </span></p><p class="CL2013Text" style="margin: 0cm 0cm 0.0001pt; text-align: justify; font-size: 11pt; font-family: 'Times New Roman'; "><span style="font-family: Calibri, sans-serif; font-size: 14px; ">is a multi-document summariser that treats all documents to be summarised as a single bag of sentences. </span></p><p class="CL2013Text" style="margin: 0cm 0cm 0.0001pt; text-align: justify; font-size: 11pt; font-family: 'Times New Roman'; "><span style="font-family: Calibri, sans-serif; font-size: 14px; ">The sentences of all the documents are clustered using different number of clusters. </span></p><p class="CL2013Text" style="margin: 0cm 0cm 0.0001pt; text-align: justify; font-size: 11pt; font-family: 'Times New Roman'; "><span style="font-family: Calibri, sans-serif; font-size: 14px; ">A summary is created by selecting sentences from the biggest cluster only (if there are two we select the </span></p><p class="CL2013Text" style="margin: 0cm 0cm 0.0001pt; text-align: justify; font-size: 11pt; font-family: 'Times New Roman'; "><span style="font-family: Calibri, sans-serif; font-size: 14px; ">first biggest cluster). </span><span style="font-family: Calibri, sans-serif; font-size: 14px; ">We generated 2,057 multi-document extractive system summaries with a summary for </span></p><p class="CL2013Text" style="margin: 0cm 0cm 0.0001pt; text-align: justify; font-size: 11pt; font-family: 'Times New Roman'; "><span style="font-family: Calibri, sans-serif; font-size: 14px; ">each 10, 100 and 500 articles in each category, </span><span style="font-family: Calibri, sans-serif; font-size: 14px; ">in addition to a summary for all the articles in each category.</span></p><p class="CL2013Text" style="margin: 0cm 0cm 0.0001pt; text-align: justify; font-size: 11pt; font-family: 'Times New Roman'; "></p><p class="CL2013Text" style="margin: 0cm 0cm 0.0001pt; text-align: justify; font-size: 11pt; font-family: 'Times New Roman'; "><br></p><p class="CL2013Text" style="margin: 0cm 0cm 0.0001pt; text-align: justify; font-size: 11pt; font-family: 'Times New Roman'; ">Secondly, we used an Arabic Named Entity Recognition system (ANER) (Koulali and Meziane 2012) </p><p class="CL2013Text" style="margin: 0cm 0cm 0.0001pt; text-align: justify; font-size: 11pt; font-family: 'Times New Roman'; ">to annotate the data collection. </p><p class="CL2013Text" style="margin: 0cm 0cm 0.0001pt; text-align: justify; font-size: 11pt; font-family: 'Times New Roman'; ">To annotate the data collection we followed the Computational Natural Language Learning (CoNLL) 2002 </p><p class="CL2013Text" style="margin: 0cm 0cm 0.0001pt; text-align: justify; font-size: 11pt; font-family: 'Times New Roman'; ">and 2003 shared tasks formed by tags falling into any of the following four categories: </p><p class="CL2013Text" style="margin: 0cm 0cm 0.0001pt; text-align: justify; font-size: 11pt; font-family: 'Times New Roman'; ">•Person Names: محمود درويش (Mahmoud Darwish). </p><p class="CL2013Text" style="margin: 0cm 0cm 0.0001pt; text-align: justify; font-size: 11pt; font-family: 'Times New Roman'; ">•Location names: المغرب (Morocco).</p><p class="CL2013Text" style="margin: 0cm 0cm 0.0001pt; text-align: justify; font-size: 11pt; font-family: 'Times New Roman'; ">•Organisation Names: الأمم المتحدة (United Nations).</p><p class="CL2013Text" style="margin: 0cm 0cm 0.0001pt; text-align: justify; font-size: 11pt; font-family: 'Times New Roman'; ">•Miscellaneous Names: NEs not belonging to any of the previous classes and include date, time, number,</p><p class="CL2013Text" style="margin: 0cm 0cm 0.0001pt; text-align: justify; font-size: 11pt; font-family: 'Times New Roman'; ">monetary expressions, measurement expressions and percentages. <span style="font-family: Calibri, sans-serif; font-size: 14px; ">ANER system was trained using ANERCorpus </span></p><p class="CL2013Text" style="margin: 0cm 0cm 0.0001pt; text-align: justify; font-size: 11pt; font-family: 'Times New Roman'; "><span style="font-family: Calibri, sans-serif; font-size: 14px; ">(Benajiba et al. 2007), a manually annotated corpus following the CoNLL shared task. The reason behind choosing</span></p><p class="CL2013Text" style="margin: 0cm 0cm 0.0001pt; text-align: justify; font-size: 11pt; font-family: 'Times New Roman'; "><span style="font-family: Calibri, sans-serif; font-size: 14px; ">ANERCorpus to train our system was that the corpus articles were chosen from Arabic newswires and Wikipedia</span></p><p class="CL2013Text" style="margin: 0cm 0cm 0.0001pt; text-align: justify; font-size: 11pt; font-family: 'Times New Roman'; "><span style="font-family: Calibri, sans-serif; font-size: 14px; ">Arabic, which is quite close to Alwatan’s data collection.</span></p><p class="CL2013Text" style="margin: 0cm 0cm 0.0001pt; text-align: justify; font-size: 11pt; font-family: 'Times New Roman'; "><br></p><p class="CL2013Text" style="margin: 0cm 0cm 0.0001pt; text-align: justify; font-size: 11pt; font-family: 'Times New Roman'; ">Thirdly, we used Stanford POSTagger (Toutanova et al. 2003) to annotate the 20,291 document collection. </p><p class="CL2013Text" style="margin: 0cm 0cm 0.0001pt; text-align: justify; font-size: 11pt; font-family: 'Times New Roman'; ">The model for Arabic was trained using the Arabic Tree-bank p1-3 corpus based on maximum entropy and </p><p class="CL2013Text" style="margin: 0cm 0cm 0.0001pt; text-align: justify; font-size: 11pt; font-family: 'Times New Roman'; ">using augmented Bies mapping of ATB tags. <span style="font-family: Calibri, sans-serif; font-size: 14px; ">The POStagger identifies 33 part of speeches, using the Penn</span></p><p class="CL2013Text" style="margin: 0cm 0cm 0.0001pt; text-align: justify; font-size: 11pt; font-family: 'Times New Roman'; "><span style="font-family: Calibri, sans-serif; font-size: 14px; ">Treebank project codification such as: Noun (NN), Plural Noun (NNS), Proper Noun (NNP), Verb (VB), Adjective (JJ). </span></p><p class="CL2013Text" style="margin: 0cm 0cm 0.0001pt; text-align: justify; font-size: 11pt; font-family: 'Times New Roman'; "><span style="font-family: Calibri, sans-serif; font-size: 14px; ">The tagger reached an accuracy of 96.50%. </span></p><p class="CL2013Text" style="margin: 0cm 0cm 0.0001pt; text-align: justify; font-size: 11pt; font-family: 'Times New Roman'; "><br></p><p class="CL2013Text" style="margin: 0cm 0cm 0.0001pt; text-align: justify; font-size: 11pt; font-family: 'Times New Roman'; ">Finally, we applied a morphological analysis process on the data collection using Alkhalil morphological</p><p class="CL2013Text" style="margin: 0cm 0cm 0.0001pt; text-align: justify; font-size: 11pt; font-family: 'Times New Roman'; "><span style="font-family: Calibri, sans-serif; font-size: 14px; ">analyser (Mazroui et al. 2011). </span><span style="font-family: Calibri, sans-serif; font-size: 14px; ">The Analysis was carried out in the following steps: pre-processing (removal of diacritics) </span></p><p class="CL2013Text" style="margin: 0cm 0cm 0.0001pt; text-align: justify; font-size: 11pt; font-family: 'Times New Roman'; "><span style="font-family: Calibri, sans-serif; font-size: 14px; ">and segmentation (each word is considered as [proclitic + stem + enclitic]).</span></p><p class="CL2013Text" style="margin: 0cm 0cm 0.0001pt; text-align: justify; font-size: 11pt; font-family: 'Times New Roman'; ">Applying Alkhalil analyser on the data collection we reached an accuracy of 96%. </p><p class="CL2013Text" style="margin: 0cm 0cm 0.0001pt; text-align: justify; font-size: 11pt; font-family: 'Times New Roman'; "><br></p><p></p><div><hr align="center" size=""3"" width=""95%""></div><div><br></div><p class="CL2013Text" style="margin: 0cm 0cm 0.0001pt; text-align: justify; font-size: 11pt; font-family: 'Times New Roman'; "><span style="font-family: Calibri, sans-serif; font-size: 14px; ">We provide KALIMAT for free including the articles, annotated text, </span></p><p class="CL2013Text" style="margin: 0cm 0cm 0.0001pt; text-align: justify; font-size: 11pt; font-family: 'Times New Roman'; "><span style="font-family: Calibri, sans-serif; font-size: 14px; ">entities and summaries to help </span><span style="font-family: Calibri, sans-serif; font-size: 14px; ">advancing the work on Arabic NLP. </span></p><p class="CL2013Text" style="margin: 0cm 0cm 0.0001pt; text-align: justify; font-size: 11pt; font-family: 'Times New Roman'; "></p><p class="CL2013Text" style="margin: 0cm 0cm 0.0001pt; text-align: justify; font-size: 11pt; font-family: 'Times New Roman'; "><br></p><p class="CL2013Text" style="margin: 0cm 0cm 0.0001pt; text-align: justify; font-size: 11pt; font-family: 'Times New Roman'; ">The corpus can be downloaded directly from:</p><p class="CL2013Text" style="margin: 0cm 0cm 0.0001pt; text-align: justify; font-size: 11pt; font-family: 'Times New Roman'; "><a href="http://bit.ly/16jO3Ks" style="color: purple; ">http://bit.ly/16jO3Ks</a> </p><p class="CL2013Text" style="margin: 0cm 0cm 0.0001pt; text-align: justify; font-size: 11pt; font-family: 'Times New Roman'; ">[<span style="font-family: Calibri, sans-serif; font-size: 14px; "><a href="https://sourceforge.net/projects/kalimat" style="font-family: 'Times New Roman'; color: purple; ">https://sourceforge.net/projects/kalimat</a>/.]</span></p><p class="CL2013Text" style="margin: 0cm 0cm 0.0001pt; text-align: justify; font-size: 11pt; font-family: 'Times New Roman'; "><span style="font-family: Calibri, sans-serif; font-size: 14px; ">[</span><a href="http://www.lancs.ac.uk/staff/elhaj/corpora.htm" style="font-family: Calibri, sans-serif; color: purple; font-size: 14px; ">http://www.lancs.ac.uk/staff/elhaj/corpora.htm</a><span style="font-family: Calibri, sans-serif; font-size: 14px; ">]</span></p><p class="CL2013Text" style="margin: 0cm 0cm 0.0001pt; text-align: justify; font-size: 11pt; font-family: 'Times New Roman'; "><span style="font-family: Calibri, sans-serif; font-size: 14px; "><br></span></p><p class="CL2013Text" style="margin: 0cm 0cm 0.0001pt; text-align: justify; font-size: 11pt; font-family: 'Times New Roman'; ">The work will be presented at the <span style="font-family: Calibri, sans-serif; font-size: 14px; ">Second Workshop on Arabic Corpus Linguistics </span></p><p class="CL2013Text" style="margin: 0cm 0cm 0.0001pt; text-align: justify; font-size: 11pt; font-family: 'Times New Roman'; "><span style="font-family: Calibri, sans-serif; font-size: 14px; ">(WACL-2) </span><span style="font-family: Calibri, sans-serif; font-size: 14px; ">Workshop </span><span style="font-family: Calibri, sans-serif; font-size: 14px; ">in conjunction with the Corpus Linguistics 2013 conference</span></p><p class="CL2013Text" style="margin: 0cm 0cm 0.0001pt; text-align: justify; font-size: 11pt; font-family: 'Times New Roman'; "></p><p class="CL2013Text" style="margin: 0cm 0cm 0.0001pt; text-align: justify; font-size: 11pt; font-family: 'Times New Roman'; "><span style="font-family: Calibri, sans-serif; font-size: 14px; ">Monday 22nd July 2013 – Lancaster University, UK</span></p><p class="CL2013Text" style="margin: 0cm 0cm 0.0001pt; text-align: justify; font-size: 11pt; font-family: 'Times New Roman'; "><a href="http://ucrel.lancs.ac.uk/cl2013/wacl2.php" style="color: purple; ">http://ucrel.lancs.ac.uk/cl2013/wacl2.php</a></p><p class="CL2013Text" style="margin: 0cm 0cm 0.0001pt; text-align: justify; font-size: 11pt; font-family: 'Times New Roman'; "><br></p><p></p><p class="CL2013Text" style="margin: 0cm 0cm 0.0001pt; text-align: justify; font-size: 11pt; font-family: 'Times New Roman'; ">The corpus and the results we achieved can be used by researchers as </p><p class="CL2013Text" style="margin: 0cm 0cm 0.0001pt; text-align: justify; font-size: 11pt; font-family: 'Times New Roman'; ">gold-standards and or baselines to test and evaluate their Arabic tools. </p><p class="CL2013Text" style="margin: 0cm 0cm 0.0001pt; text-align: justify; font-size: 11pt; font-family: 'Times New Roman'; ">We also welcome any amendments to the corpus by other researchers.</p><p class="CL2013Text" style="margin: 0cm 0cm 0.0001pt; text-align: justify; font-size: 11pt; font-family: 'Times New Roman'; ">In our work we address the shortage of relevant data for Arabic natural </p><p class="CL2013Text" style="margin: 0cm 0cm 0.0001pt; text-align: justify; font-size: 11pt; font-family: 'Times New Roman'; ">language processing, <span style="font-family: Calibri, sans-serif; font-size: 14px; ">taking into consideration the lack of Arab participants </span></p><p class="CL2013Text" style="margin: 0cm 0cm 0.0001pt; text-align: justify; font-size: 11pt; font-family: 'Times New Roman'; "><span style="font-family: Calibri, sans-serif; font-size: 14px; ">to come up with resources that are important for researchers working on Arabic NLP. </span></p><p></p><p class="CL2013Text" style="margin: 0cm 0cm 0.0001pt; text-align: justify; font-size: 11pt; font-family: 'Times New Roman'; "></p><div><div id="ftn1"></div></div><p></p><p class="CL2013TextIndent" style="margin: 0cm 0cm 0.0001pt; text-align: justify; text-indent: 11.35pt; font-size: 11pt; font-family: 'Times New Roman'; "><o:p></o:p></p></div><div>______________________________________________________________________</div><div><br></div><div>KALIMAT uses copyright material. Details of the terms of the</div><div>applicable copyrights are described in the file COPYRIGHT that</div><div>accompanies this resource. The sources of the documents is the </div><div>Omani Newspaper, Alwatan <a href="http://www.alwatan.com/" style="font-family: 'Times New Roman'; color: purple; ">http://www.alwatan.com</a>.</div><div><br></div><div>KALIMAT was created by </div><div>1-Mahmoud El-Haj <<a href="mailto:m.el-haj@lancaster.ac.uk" style="font-family: 'Times New Roman'; color: purple; ">m.el-haj@lancaster.ac.uk</a>></div><div><a href="http://www.lancs.ac.uk/staff/elhaj/" style="font-family: 'Times New Roman'; color: purple; ">http://www.lancs.ac.uk/staff/elhaj/</a></div><div>And </div><div>2-Rim Koulali <<a href="mailto:rim.koulali@gmail.com" style="font-family: 'Times New Roman'; color: purple; ">rim.koulali@gmail.com</a>></div><div><br></div><div>1-School of Computing and Communications, Lancaster University, Lancaster, Lancashire, UK.</div><div>2-LARI Laboratory, Mohammed 1 University, Oujda, Morocco.</div></div><div style="font-family: Calibri, sans-serif; font-size: 14px; "><br></div><div style="font-family: Calibri, sans-serif; font-size: 14px; "><hr align="center" size=""3"" width=""95%""></div><div style="font-family: Calibri, sans-serif; font-size: 14px; "><br></div><div style="font-family: Calibri, sans-serif; font-size: 14px; ">Reference</div><div style="font-family: Calibri, sans-serif; font-size: 14px; "><br></div><div style="font-family: Calibri, sans-serif; font-size: 14px; "><div>Abbas, M., Smaili, K. and Berkani, D. 2011. “Evaluation of Topic Identification </div><div>Methods on Arabic Corpora”. Journal of Digital Information Management,vol. </div><div>9, N. 5, pp.185-192.</div><div><br></div><div>Al-Sulaiti, L., Atwell, ES. and Steven, E. 2006. “The design of a corpus of </div><div>Contemporary Arabic”. International Journal of Corpus Linguistics, 11(2): </div><div>135–171.</div><div><br></div><div>Benajiba, Y., Rosso, P. and BenedRuiz, J. 2007. Anersys: An arabic named entity</div><div>recognition system based on maximum entropy. Computational Linguistics and</div><div>Intelligent Text Processing, 143–153.</div><div><br></div><div>El-Haj, M., Kruschwitz, U. and Fox, C. 2010. “Using Mechanical Turk to Create a</div><div>Corpus of Arabic Summaries”. In The 7th International Language Resources and</div><div>Evaluation Conference (LREC 2010)., pages 36–39, Valletta, Malta,. LREC.</div><div><br></div><div>El-Haj, M., Kruschwitz, U. and Fox, C. 2011. “Exploring Clustering for</div><div>Multi-Document Arabic Summarisation”. In The 7th Asian Information Retrieval</div><div>Societies (AIRS 2011), volume 7097 of Lecture Notes in Computer Science, pages</div><div>550–561. Springer Berlin / Heidelberg.</div><div><br></div><div>Koulali, R. and Meziane, A. 2012. “A contribution to Arabic Named Entity</div><div>Recognition”. In ICT and Knowledge Engineering. ICT Knowledge Engineering, </div><div>pages 46–52.</div><div><br></div><div>Mazroui, A., Meziane, A., Ould Abdallahi Ould Bebah, M., Boudlal, A., Lakhouaja, </div><div>A and Shoul, M. 2011. ALkhalil morphosys: Morphosyntactic analysis system for </div><div>non voalized Arabic. In Proceeding of the 7th International Computing Conference </div><div>in Arabic.</div><div><br></div><div>Salton G., Wong A. and Yang, S. 2003. “A Vector Space Model for Automatic </div><div>Indexing”. Proceedings of the Communications of the ACM, 18(11):613–620, 1975.</div><div><br></div><div>Toutanova, K., Klein, D., Manning, C.D. and Singer, Y. 2003. “Feature-Rich </div><div>Part-Of-Speech Tagging With a Cyclic Dependency Network”. In Proceedings </div><div>of the 2003 Conference of the North American Chapter of the Association for </div><div>Computational Linguistics on Human Language Technology - Volume 1, NAACL ’03, </div><div>pages 173–180. </div></div><div style="font-family: Calibri, sans-serif; font-size: 14px; "><br></div><div style="font-family: Calibri, sans-serif; font-size: 14px; "><br></div><div style="font-family: Calibri, sans-serif; font-size: 14px; "><hr align="center" size=""3"" width=""95%""></div><div style="font-family: Calibri, sans-serif; font-size: 14px; ">Best,</div><div style="font-family: Calibri, sans-serif; font-size: 14px; "><br></div><div style="font-family: Calibri, sans-serif; font-size: 14px; ">Dr. Mahmoud El-Haj</div><div style="font-family: Calibri, sans-serif; font-size: 14px; ">Research Associate</div><div style="font-family: Calibri, sans-serif; font-size: 14px; ">School of Computing and Communications</div><div style="font-family: Calibri, sans-serif; font-size: 14px; ">InfoLab21, Lancaster University</div><div style="font-family: Calibri, sans-serif; font-size: 14px; ">Lancaster, Lancashire, UK</div><div><br></div></div></div></body></html>