34.458, Calls: babyLM Challenge - shared task hosted in CoNLL/CMCL 2023

Fri Feb 3 17:05:05 UTC 2023

LINGUIST List: Vol-34-458. Fri Feb 03 2023. ISSN: 1069 - 4875.

Subject: 34.458, Calls: babyLM Challenge - shared task hosted in CoNLL/CMCL 2023

Moderator: Malgorzata E. Cavar, Francis Tyers (linguist at linguistlist.org)
Managing Editor: Lauren Perkins
Team: Helen Aristar-Dry, Steven Franks, Everett Green, Sarah Robinson, Joshua Sims, Jeremy Coburn, Daniel Swanson, Matthew Fort, Maria Lucero Guillen Puon, Billy Dickson
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org

Homepage: http://linguistlist.org

Please support the LL editors and operation with a donation at:
           https://funddrive.linguistlist.org/donate/

Editor for this issue: Everett Green <everett at linguistlist.org>
================================================================

Date: 
From: Leshem Choshen [leshem.choshen at mail.huji.ac.il]
Subject: babyLM Challenge - shared task hosted in CoNLL/CMCL 2023

Full Title: babyLM Challenge - shared task hosted in CoNLL/CMCL 2023
Short Title: babyLM

Date: 01-Oct-2023 - 06-Oct-2023
Location: CoNLL conference or CMCL workshop, Singapore
Contact Person: Leshem Choshen
Meeting Email: leshem.choshen at mail.huji.ac.il
Web Site: https://babylm.github.io/

Linguistic Field(s): Applied Linguistics; Cognitive Science;
Computational Linguistics; Language Acquisition; Text/Corpus
Linguistics

Call Deadline: 15-Jul-2023

Meeting Description:

Announcing the BabyLM Challenge, the shared task at CoNLL/CMCL 2023!

The goal of this shared task is to encourage researchers with an
interest in pretraining and/or cognitive modeling to focus their
efforts on optimizing pretraining given data limitations inspired by
human development. Additionally, we hope to democratize research on
pretraining—which is typically thought to be practical only for large
industry groups—by formulating an exciting open problem and
establishing a community around it.

A huge effort has been put towards optimizing LM pretraining at
massive scales in the last several years. While increasingly larger
models often get the most attention, datasets have also grown by
orders of magnitude. For example, Chinchilla is exposed to 1.4
trillion words during training—well over 10000 words for every one
word a 13-year-old human has encountered in their entire life.

Focusing on scaled-down pretraining has several potential benefits:
First, small-scale pretraining can be a sandbox for developing novel
techniques for improving data efficiency. These techniques have the
potential to then scale up to larger scales commonly seen in applied
NLP or used to enhance current approaches to modeling low-resource
languages. Second, improving our ability to train LMs on the same
kinds and quantities of data that humans learn from hopefully will
give us greater access to plausible cognitive models of humans and
help us understand what allows humans to acquire language so
efficiently.

Call for Papers:

The task has three tracks, two of which restrict the training data to
pre-released datasets of 10M and 100M words and are dedicated to
explorations of approaches such as architectural variations,
self-supervised objectives, and/or curriculum learning. The final
track only restricts the amount of text used, allowing innovation in
the choice of the data, its domain, and even its modality (i.e., data
from sources other than text is welcome). We will release a shared
evaluation pipeline that evaluates on a variety of benchmarks and
tasks, including targeted syntactic evaluations and natural language
understanding.

Important dates:

January 2023: Training data released (see website for download)

March 2023: Evaluation pipeline released

July 15, 2023: Results due

August 1, 2023: Paper submissions due

Date TBA: Presentation at CoNLL

This is a bit of an unusual call for papers, we hope to reach
linguists and find computational models that are more relevant to
explain learning or are inspired by linguistic knowledge.

For more information, visit the BabyLM website
https://babylm.github.io/ or consult our extended call for papers.

------------------------------------------------------------------------------

LINGUIST List is supported by the following publishers:

Bloomsbury Publishing (formerly The Continuum International Publishing Group) http://www.bloomsbury.com/uk/

Brill http://www.brill.com

Cascadilla Press http://www.cascadilla.com/

Georgetown University Press http://www.press.georgetown.edu

John Benjamins http://www.benjamins.com/

Lincom GmbH https://lincom-shop.eu/

Multilingual Matters http://www.multilingual-matters.com/

Springer Nature http://www.springer.com

----------------------------------------------------------
LINGUIST List: Vol-34-458
----------------------------------------------------------