[Corpora-List] Call for participation: SEMEVAL Task #18: Arabic Semantic labeling

Mona Diab mdiab at cs.columbia.edu
Fri Mar 23 22:55:04 UTC 2007


[APOLOGIES FOR DUPLICATES]



The train and test data is now ready for download from the main SEMEVAL
webpage at http://nlp.cs.swarthmore.edu/semeval/



The relevant dates are included on the webpage



Below is a description of the task:



Tasks:



We propose several tasks for Arabic Semantic Labeling.  The tasks will span
both the WSD and Semantic Role labeling processes for this evaluation. Both
sets of tasks will be evaluated on data derived from the same data set, the
test set.



We propose 3 subtasks for WSD all of which will only have test data for
evaluation and trial data for formatting purposes. This will be taken from
the Arabic Treebank 3v2 text data, roughly 3000 words long:



1.       The first task is to discover different senses in the data for
nouns and verbs without associating labels with those senses. Therefore it
is a sense discrimination task.

In this task the participants will be required to identify that the
different number of senses for nouns and verbs without associating labels
with those identified senses. The assumption is that word is one of these
senses identified. These senses will be derived from the Arabic WordNet,
which correspond to English WN 2.0. There will be two levels of granularity,
coarse and fine grain.



2.       The second task is to annotate all nouns and verbs in the data with
Arabic WordNet senses (provided with the test data, and also accessible via
the web at http://www.globalwordnet.org/AWN

All verbs and nouns in the data will need to be annotated with their sense
indices and/or offsets from Arabic WordNet



3.       The third task is to annotate all nouns and verbs in the data with
English wordnet senses

a.       In this task, the participants will be required to link the Arabic
nouns and verbs with their corresponding sense(s) in the English WordNet 2.0

b.       An English translation corpus will be provided along with the
trial/test data

c.       A bilingual word list will also be provided



  We propose 2 subtasks for Semantic Role Labeling (SRL). These subtasks will
have trial, training and test data available for it:



4.       Identifying Arguments in a sentence

In this task, the participants are required to identify all the constituents
in a constituency tree that should be annotated with argument roles related
to some predetermined verbs





5.       Automatic annotations for all arguments

In this task, the participants are required to identify and label all the
constituents in a constituency tree that should be annotated with both
numbered argument roles and ARGM roles related to some predetermined verbs



Data



The data will be Arabic Treebank 3 v.2 data which is newswire in Modern
Standard Arabic. The data will be presented in ascii encoding, with the
Buckwalter transliteration scheme. The data will be unvowelised and
tokenized according to the Arabic Treebank clitic tokenization scheme. We
will provide code for conversion of encoding from UTF-8 and CP1256 to the
Buckwalter transliteration scheme. Moreover, we will provide code for the
tokenization, POS tagging and Base Phrase chunking of the Arabic text, a
package can be downloaded from
http://www.cs.columbia.edu/~mdiab/ASVMTools.tar.gz.



We will only opt for 100 most frequent verbs in this set to draw training,
trial (for the semantic role labeling tasks) and test data for the semantic
role labeling and WSD tasks)

The data is syntactically and morphologically manually annotated. The
syntactic trees are constituency trees.

A preliminary version of the Arabic WordNet will be available



Evaluation metric



SRL: Conlleval metrics of precision recall and f measure

WSD: Scorer 2.0 metrics of precision, recall and f-measure on both coarse
and fine grained sense distinctions.

****************************************************************************
****************************************************************************
******



Mona T. Diab, PhD

Center for Computational Learning Systems

Computational Linguistics Group

Columbia University



Tel.: +1 212 870 1290

Fax: +1 212 870 1285



More information about the Corpora mailing list