32.275, FYI: DETOXIS Task: DEtection of TOxicity in comments In Spanish (IberLEF-2021)

Thu Jan 21 07:11:34 UTC 2021

LINGUIST List: Vol-32-275. Thu Jan 21 2021. ISSN: 1069 - 4875.

Subject: 32.275, FYI: DETOXIS Task: DEtection of TOxicity in comments In Spanish (IberLEF-2021)

Moderator: Malgorzata E. Cavar (linguist at linguistlist.org)
Student Moderator: Jeremy Coburn
Managing Editor: Becca Morris
Team: Helen Aristar-Dry, Everett Green, Sarah Robinson, Lauren Perkins, Nils Hjortnaes, Yiwen Zhang, Joshua Sims
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org

Homepage: http://linguistlist.org

Please support the LL editors and operation with a donation at:
           https://funddrive.linguistlist.org/donate/

Editor for this issue: Everett Green <everett at linguistlist.org>
================================================================

Date: Thu, 21 Jan 2021 02:10:02
From: Mariona Taulé [mtaule at ub.edu]
Subject: DETOXIS Task: DEtection of TOxicity in comments In Spanish (IberLEF-2021)

It will take place as part of IberLEF 2021, the 3rd Workshop on Iberian
Languages Evaluation Forum at the SEPLN 2021 Conference, which will be held in
September 2021 in Spain.

Webpage: https://detoxisiberlef.wixsite.com/website

The aim of the task is the detection of toxicity in comments posted in Spanish
in response to different online news articles related to immigration.
The DETOXIS task is divided into two related classification subtasks:
- Subtask 1: Toxicity detection task is a binary classification task that
consists of classifying the content of a comment as toxic (toxic=yes) or not
toxic (toxic=no).
- Subtask 2: Toxicity level detection task is a more fine grained
classification task in which the aim is to identify the level of toxicity of a
comment (0= not toxic; 1= mildly toxic; 2= toxic and 3: very toxic).

Although we recommend to participate in both subtasks, participants are
allowed to participate just in one of them (e.g., subtask 1).
Teams will be allowed (and encouraged) to submit multiple runs (max. 5).

A comment is toxic when it attacks, threatens, insults, offends, denigrates or
disqualifies a person or group of people on the basis of characteristics such
as race, ethnicity, nationality, political ideology, religion, gender and
sexual orientation, among others. This attack can be expressed in different
ways –explicitly (through insult, mockery and inappropriate humor) or
implicitly (for instance through sarcasm)– and at different levels of
intensity, that is at different levels of toxicity (from impolite and
offensive comments to the most aggressive, the latter being those comments
that incite hate or even physical violence). We use toxicity as an umbrella
term under which we include different definitions used in the literature to
describe hate speech and abusive, aggressive, toxic or offensive language. In
fact, these different terms address different aspects of toxic language.
The detection of toxicity, and especially its classification in different
levels, is a difficult task because the identification of toxic comments can
be determined not only by the proper linguistic content (what is being said
and the way in which it is conveyed), but also by the contextual information
(i.e., conversational thread) and the extralinguistic context, which is
related to real-world knowledge.
The presence of toxic messages on social media and the need to identify and
mitigate them leads to the development of systems for their automatic
detection. The automatic detection of toxic language, especially in tweets and
comments, is a task that has attracted growing interest from the NLP community
in recent years.
DETOXIS is the first task that focuses on the detection of different levels of
toxicity in comments posted in response to news articles written in Spanish.

We will use as a dataset the NewsCom-TOX corpus, which consists of comments
posted in response to different articles extracted from Spanish online
newspapers and discussion forums.
We will provide participants with 70% of the NewsCom-TOX corpus for training
their models, which will include all the annotated features. The remaining 30%
of the corpus (unlabeled) will be used for testing their models.

In order to avoid any conflict with the sources of comments regarding their
Intellectual Property Rights (IPR), the data will be privately sent to each
participant that is interested in the task. The corpus will be only available
for research purposes.

Important Dates:
- Training dataset release: March 1, 2021
- Test dataset release: April 22, 2021
- Systems results: May 10, 2021
- Results notification: May 17, 2021
- Working papers submission: June 2, 2021
- Working papers (peer-)reviewed: June 15, 2021
- Camera-ready versions: July 5, 2021

Linguistic Field(s): Computational Linguistics

Subject Language(s): Spanish (spa)

Language Family(ies): Spanish based

------------------------------------------------------------------------------

***************************    LINGUIST List Support    ***************************
 The 2020 Fund Drive is under way! Please visit https://funddrive.linguistlist.org
  to find out how to donate and check how your university, country or discipline
     ranks in the fund drive challenges. Or go directly to the donation site:
                   https://crowdfunding.iu.edu/the-linguist-list

                        Let's make this a short fund drive!
                Please feel free to share the link to our campaign:
                    https://funddrive.linguistlist.org/donate/

----------------------------------------------------------
LINGUIST List: Vol-32-275	
----------------------------------------------------------