31.2037, FYI: PARSEME shared task 1.2 on semi-supervised identification of verbal multiword expressions

Sat Jun 20 07:32:13 UTC 2020

LINGUIST List: Vol-31-2037. Sat Jun 20 2020. ISSN: 1069 - 4875.

Subject: 31.2037, FYI: PARSEME shared task 1.2 on semi-supervised identification of verbal multiword expressions

Moderator: Malgorzata E. Cavar (linguist at linguistlist.org)
Student Moderator: Jeremy Coburn
Managing Editor: Becca Morris
Team: Helen Aristar-Dry, Everett Green, Sarah Robinson, Lauren Perkins, Nils Hjortnaes, Yiwen Zhang, Joshua Sims
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org

Homepage: http://linguistlist.org

Please support the LL editors and operation with a donation at:
           https://funddrive.linguistlist.org/donate/

Editor for this issue: Everett Green <everett at linguistlist.org>
================================================================

Date: Sat, 20 Jun 2020 03:29:59
From: Marie Candito [marie.candito at gmail.com]
Subject: PARSEME shared task 1.2 on semi-supervised identification of verbal multiword expressions

http://multiword.sourceforge.net/sharedtask2020

PARSEME shared task 1.2 - Final call for participation

The third edition of the PARSEME shared task on automatic identification of
verbal multiword expressions (VMWEs) aims at identifying **verbal MWEs** in
running text, with **emphasis on discovering VMWEs that were not seen in the
training corpus**.

See the shared task web site for all additional information :
http://multiword.sourceforge.net/sharedtask2020

#### Blind test data and upload of system results

The PARSEME team has prepared corpora in which VMWEs were manually annotated:
https://gitlab.com/parseme/corpora/wikis/home. The provided annotations follow
the PARSEME 1.2 guidelines:
https://parsemefr.lis-lab.fr/parseme-st-guidelines/1.2/.

On March 23, 2020, we released, for each language: 

* a training corpus manually annotated for VMWEs;
* a development corpus to tune/optimize the systems' parameters ; and
* a syntactically parsed raw corpus, not annotated for VMWEs, to support semi-
and unsupervised methods for VMWE discovery (for each language, the size is
between 12 million tokens and 2.5 billion tokens)

On July 1, 2020, we will release, for each language:
* A blind test corpus to be used as input to the systems during the evaluation
phase, during which the VMWE annotations will be kept secret.

On July 3, 2020, participants will have to upload their annotated version of
the test corpus at 
https://www.softconf.com/coling2020/MWE-LEX/

Morphosyntactic annotations (parts of speech, lemmas, morphological features,
and syntactic dependencies) are also provided, both for annotated and raw
corpora.

The annotated training and development corpora are released in the CUPT format
(which is the CoNLL-U format with an extra column for the MWE annotations).
The raw corpora are released in the CoNLL-U format. The blind test corpus will
be released in the CUPT format, with an underspecified 11th column to be
predicted. Reference annotations for the test copus will be released after the
evaluation phase.

The trial data, training and dev sets are available on the shared task's
release repository: https://gitlab.com/parseme/sharedtask-data/tree/master/1.2

The raw corpus is available on the corpus initiative website:

https://gitlab.com/parseme/corpora/wikis/Raw-corpora-for-the-PARSEME-1.2-share
d-task

Corpora are available for the following languages: German (DE), Greek (EL),
Basque (EU), French (FR), Irish (GA), Hebrew (HE), Hindi (HI), Italian (IT),
Polish (PL), Brazilian Portuguese (PT), Romanian (RO), Swedish (SV), Turkish
(TR), Chinese (ZH).

The amount of annotated data in the training, development, test, and raw
corpus depends on the language.

#### Corpus split

For each language, the annotated sentences are shuffled and split, in a way
which ensures that there is a minimum of 300 VMWEs in the test set which are
unseen in the training + dev sets. This means that the natural sequence of
sentences in a document will not be respected in the proposed corpus split.
Note the unseen ratio, that is, the proportion of unseen VMWEs wrt all VMWEs
in the test set, may vary across languages. To guide participants on this hard
task, the number and rate of unseen VMWEs for the dev corpora are available on
the shared task website. In both tracks, the use of previous shared task
editions' corpora, and from the PARSEME source repositories, is strictly
forbidden, as material may have moved during corpus splits.

#### Important dates (updated)

* Jul 01, 2020: blind test corpus released
* Jul 03, 2020: submission of system results
* Jul 09, 2020: announcement of results
* Sep 02, 2020: shared task system description papers due (same as regular
papers)
* Oct 16, 2020: notification of acceptance
* Nov01, 2020: camera-ready system description papers due
* Dec 13, 2020: shared task session at the MWE-LEX 2020 workshop at Coling
2020

Linguistic Field(s): Computational Linguistics

------------------------------------------------------------------------------

***************************    LINGUIST List Support    ***************************
 The 2019 Fund Drive is under way! Please visit https://funddrive.linguistlist.org
  to find out how to donate and check how your university, country or discipline
     ranks in the fund drive challenges. Or go directly to the donation site:
               https://iufoundation.fundly.com/the-linguist-list-2019

                        Let's make this a short fund drive!
                Please feel free to share the link to our campaign:
                    https://funddrive.linguistlist.org/donate/

----------------------------------------------------------
LINGUIST List: Vol-31-2037	
----------------------------------------------------------