[Lingtyp] Call for Participation: SIGTYP 2020 Shared Task on the prediction of typological features

Fri Apr 3 00:53:16 UTC 2020

https://sigtyp.github.io/st2020.html

The SIGTYP workshop, co-located with the EMNLP 2020 conference in Punta Cana (Dominican Republic), is offering a shared task on the prediction of typological features. The shared task encompasses nearly 2,000 languages, with typological features taken from the World Atlas of Language Structures (WALS; Dryer and Haspelmath 2013).

To participate in the shared task, you will build a system that can predict typological properties of languages, given a handful of observed features. Training examples and development examples have already been provided (see link below). All submitted systems will be compared on a held-out test set.

Moreover, you will be invited to describe your system in a system paper for the SIGTYP workshop proceedings. The task organisers will write an overview paper that describes the task and summarises the different approaches taken, and their results.

Important Links

- Download Train and Dev data: https://github.com/sigtyp/ST2020/tree/master/data
- Register for the Task! https://sigtyp.github.io/st2020-reg.html

Important Dates

- Training data Release: 26 March 2020
- Test data Release: 20 June 2020
- Submissions Due: 1 July 2020
- Writeup Due: 1 August 2020

Description

The typological features in WALS represent one approach to the categorization of the languages of the world according to their linguistic properties, e.g. in terms of their syntax, morphology, phonology inter alia. One example of such a typological feature is the basic word order feature. For instance, English is best described as a subject-verb-object (SVO) language whereas Japanese is best described as a subject-object-verb (SOV) language.

One major issue with WALS, however, is that it is both sparse and skewed in terms of language-feature annotations. It is sparse in the sense that most languages only have annotations for a handful of features, and skewed in the sense that a few features have much wider coverage than others. Luckily, such features often correlate with one another, which allows for prediction of those features from others. For instance, languages where the verb precedes the object tend to have prepositions, e.g. Norwegian, whereas languages where the object precedes the verb word tend to have postpositions, e.g. Japanese.

Although there is a significant amount of previous work dealing with versions of this task (Daumé III and Campbell 2017; Bjerva et al. 2019; Ponti et al. 2019), important design choices have been frequently ignored. Some papers controlled for genetic relationships between training and evaluation languages, but little-to-no work has considered controlling for geographical proximity.

The shared task will consist of two settings (subtasks):

  1.  Constrained: only provided training data can be employed.
  2.  Unconstrained: training data can be extended with any external source of information (e.g. pre-trained embeddings, raw texts, etc.)

Organizers

Johannes Bjerva
Isabelle Augenstein
Aditi Chaudhary
Edoardo M. Ponti
Giuseppe Celano
Liz Salesky
Ryan Cotterell
Michael Regan
Sabrina J. Mielke

Contact

- email: sigtyp AT gmail DOT com
- website: https://sigtyp.github.io/st2020.html

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/lingtyp/attachments/20200403/e5d324c5/attachment.htm>