[Corpora-List] KALIMAT a Multipurpose Arabic Corpus

Mahmoud El-Haj dr.melhaj at gmail.com
Thu May 9 15:49:01 UTC 2013


----------- KALIMAT a Multipurpose Arabic Corpus -----------

We are pleased to announce the immediate availability of KALIMAT 1.0,

KALIMAT (transliteration of "Words" in Arabic) is an Arabic natural language
resource that consists of:
1) 20,291 Arabic articles (18,167,183 words)
collected from the Omani newspaper Alwatan  by (Abbas et al. 2011).
2) 20,291 Extractive Single-document system summaries.
3) 2,057 Extractive Multi-document system summaries.
4) 20,291 Named Entity Recognised articles.
5) 20,291 Part of Speech Tagged articles.
6) 20,291 Morphologically Analyse articles.

The data collection articles fall into six categories:
culture, economy, local-news, international-news, religion, and sports.

The process of creating KALIMAT was applied to the entire data collection
(20,291 articles).

Firstly, we summarised the document collection using two Arabic summarisers,
Gen–Summ and Arabic

Cluster-based. Gen-Summ (El-Haj et al. 2010) is a single document summariser
based on the VSM model

(Salton et al. 1975) that takes an Arabic document and its first sentence
and returns an extractive

summary. A number of 20,291 system summaries have been generated.
Cluster-based (El-Haj et al. 2011)

is a multi-document summariser that treats all documents to be summarised as
a single bag of sentences.

The sentences of all the documents are clustered using different number of
clusters. 

A summary is created by selecting sentences from the biggest cluster only
(if there are two we select the

first biggest cluster). We generated 2,057 multi-document extractive system
summaries with a summary for

each 10, 100 and 500 articles in each category, in addition to a summary for
all the articles in each category.



Secondly, we used an Arabic Named Entity Recognition system (ANER) (Koulali
and Meziane 2012) 

to annotate the data collection.

To annotate the data collection we followed the Computational Natural
Language Learning (CoNLL) 2002

and 2003  shared tasks formed by tags falling into any of the following four
categories: 

•Person Names: محمود درويش (Mahmoud Darwish).

•Location names: المغرب (Morocco).

•Organisation Names: الأمم المتحدة  (United Nations).

•Miscellaneous Names: NEs not belonging to any of the previous classes and
include date, time, number,

monetary expressions, measurement expressions and percentages. ANER system
was trained using ANERCorpus

(Benajiba et al. 2007), a manually annotated corpus following the CoNLL
shared task. The reason behind choosing

ANERCorpus to train our system was that the corpus articles were chosen from
Arabic newswires and Wikipedia

Arabic, which is quite close to Alwatan’s data collection.



Thirdly, we used Stanford POSTagger (Toutanova et al. 2003) to annotate the
20,291 document collection.

The model for Arabic was trained using the Arabic Tree-bank p1-3 corpus
based on maximum entropy and

using augmented Bies mapping of ATB tags. The POStagger identifies 33 part
of speeches, using the Penn

Treebank project codification such as: Noun (NN), Plural Noun (NNS), Proper
Noun (NNP), Verb (VB), Adjective (JJ).

The tagger reached an accuracy of 96.50%.



Finally, we applied a morphological analysis process on the data collection
using Alkhalil morphological

analyser (Mazroui et al. 2011). The Analysis was carried out in the
following steps: pre-processing (removal of diacritics)

and segmentation (each word is considered as [proclitic + stem + enclitic]).

Applying Alkhalil analyser on the data collection we reached an accuracy of
96%. 






We provide KALIMAT for free including the articles, annotated text,

entities and summaries to help advancing the work on Arabic NLP.



The corpus can be downloaded directly from:

http://bit.ly/16jO3Ks

[https://sourceforge.net/projects/kalimat/.]

[http://www.lancs.ac.uk/staff/elhaj/corpora.htm]



The work will be presented at the Second Workshop on Arabic Corpus
Linguistics 

(WACL-2) Workshop in conjunction with the Corpus Linguistics 2013 conference

Monday 22nd July 2013 – Lancaster University, UK

http://ucrel.lancs.ac.uk/cl2013/wacl2.php



The corpus and the results we achieved can be used by researchers as

gold-standards and or baselines to test and evaluate their Arabic tools.

We also welcome any amendments to the corpus by other researchers.

In our work we address the shortage of relevant data for Arabic natural

language processing, taking into consideration the lack of Arab participants

to come up with resources that are important for researchers working on
Arabic NLP. 

______________________________________________________________________

KALIMAT uses copyright material. Details of the terms of the
applicable copyrights are described in the file COPYRIGHT that
accompanies this resource.  The sources of the documents is the
Omani Newspaper, Alwatan http://www.alwatan.com <http://www.alwatan.com/> .

KALIMAT was created by
1-Mahmoud El-Haj <m.el-haj at lancaster.ac.uk>
http://www.lancs.ac.uk/staff/elhaj/
And 
2-Rim Koulali <rim.koulali at gmail.com>

1-School of Computing and Communications, Lancaster University, Lancaster,
Lancashire, UK.
2-LARI Laboratory, Mohammed 1 University, Oujda, Morocco.



Reference

Abbas, M., Smaili, K. and Berkani, D. 2011. “Evaluation of Topic
Identification 
Methods on Arabic Corpora”. Journal of Digital Information Management,vol.
9, N. 5, pp.185-192.

Al-Sulaiti, L., Atwell, ES. and Steven, E. 2006. “The design of a corpus of
Contemporary Arabic”. International Journal of Corpus Linguistics, 11(2):
135–171.

Benajiba, Y., Rosso, P. and BenedRuiz, J. 2007. Anersys: An arabic named
entity
recognition system based on maximum entropy. Computational Linguistics and
Intelligent Text Processing, 143–153.

El-Haj, M., Kruschwitz, U. and Fox, C. 2010. “Using Mechanical Turk to
Create a
Corpus of Arabic Summaries”. In The 7th International Language Resources and
Evaluation Conference (LREC 2010)., pages 36–39, Valletta, Malta,. LREC.

El-Haj, M., Kruschwitz, U. and Fox, C. 2011. “Exploring Clustering for
Multi-Document Arabic Summarisation”. In The 7th Asian Information Retrieval
Societies (AIRS 2011), volume 7097 of Lecture Notes in Computer Science,
pages
550–561. Springer Berlin / Heidelberg.

Koulali, R. and Meziane, A. 2012. “A contribution to Arabic Named Entity
Recognition”. In ICT and Knowledge Engineering. ICT Knowledge Engineering,
pages 46–52.

Mazroui, A., Meziane, A., Ould Abdallahi Ould Bebah, M., Boudlal, A.,
Lakhouaja, 
A and Shoul, M. 2011. ALkhalil morphosys: Morphosyntactic analysis system
for 
non voalized Arabic. In Proceeding of the 7th International Computing
Conference 
in Arabic.

Salton G., Wong A. and Yang, S. 2003. “A Vector Space Model for Automatic
Indexing”. Proceedings of the Communications of the ACM, 18(11):613–620,
1975.

Toutanova, K., Klein, D., Manning, C.D. and Singer, Y. 2003. “Feature-Rich
Part-Of-Speech Tagging With a Cyclic Dependency Network”. In Proceedings
of the 2003 Conference of the North American Chapter of the Association for
Computational Linguistics on Human Language Technology - Volume 1, NAACL
’03, 
pages 173–180. 



Best,

Dr. Mahmoud El-Haj
Research Associate
School of Computing and Communications
InfoLab21, Lancaster University
Lancaster, Lancashire, UK



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20130509/880eb65e/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list