37.1702, Confs: 13th Web-as-Corpus Workshop @ EMNLP 2026 (Hungary)

The LINGUIST List linguist at listserv.linguistlist.org
Thu May 7 16:05:02 UTC 2026


LINGUIST List: Vol-37-1702. Thu May 07 2026. ISSN: 1069 - 4875.

Subject: 37.1702, Confs: 13th Web-as-Corpus Workshop @ EMNLP 2026 (Hungary)

Moderator: Steven Moran (linguist at linguistlist.org)
Managing Editor: Valeriia Vyshnevetska
Team: Helen Aristar-Dry, Mara Baccaro, Daniel Swanson
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org

Homepage: http://linguistlist.org

Editor for this issue: Valeriia Vyshnevetska <valeriia at linguistlist.org>

================================================================


Date: 06-May-2026
From: Veronika Laippala [veronika.laippala at utu.fi]
Subject: 13th Web-as-Corpus Workshop @ EMNLP 2026


13th Web-as-Corpus Workshop @ EMNLP 2026
Short Title: WaC-13

Date: 24-Oct-2026 - 29-Oct-2026
Location: Budapest, Hungary
Meeting URL: https://wacky-workshop.github.io/

Linguistic Field(s): Applied Linguistics; Computational Linguistics;
Text/Corpus Linguistics

Submission Deadline: 07-Aug-2026

The World Wide Web has evolved from a resource for building linguistic
corpora into the central data infrastructure powering modern natural
language processing and Large Language Models (LLMs). As web-scale
data increasingly shapes AI systems’ knowledge and capabilities,
understanding its quality, representativeness, and ethical
implications has become critical.
At the same time, the “more is better” paradigm is being challenged by
issues such as machine-generated content, data toxicity, limited
metadata, and the under-representation of many languages and domains.
These challenges call for a shift toward Data-Centric AI, focusing on
the curation, analysis, and responsible use of web-derived data.
The 13th Web-as-Corpus (WaC-13) workshop provides a multidisciplinary
forum for research addressing the full lifecycle of web data. We
invite submissions on methods, resources, and applications related to
web corpora, with special emphasis on multilingual data and
less-resourced languages.
Topics of interest include (but are not limited to):
 - Creation and evaluation of high-quality datasets for foundation
models (e.g., data collection, filtering, enrichment, language
identification)
 - Use of web data in empirical linguistic research
 - Analysis of web-scale corpora for quality, representativeness, and
societal insights
 - Ethical and legal aspects of collecting, sharing, and using web
data
By bringing together researchers from NLP, linguistics, and the social
sciences, WaC aims to advance best practices for one of the field’s
most influential data sources.
Important Dates:
Direct paper submission deadline: 7 August, 2026
Pre-reviewed ARR commitment deadline: 1 September, 2026
Notification of acceptance: 5 September, 2026
Camera-ready paper due: 20 September, 2026
Conference dates: 24-29 Oct, 2026
Submissions:
Submissions will be possible through ARR commitment and through
openreview.net (more details to follow on
https://wacky-workshop.github.io/).
Workshop Organizers:
Nikola Ljubešić, Jožef Stefan Institute, Slovenia
Yves Scherrer, University of Oslo, Norway
Laurie Burchell, Common Crawl
Veronika Laippala, TurkuNLP, University of Turku, Finland
Pedro Ortiz Saurez, Common Crawl
Jen English, Common Crawl
Vuk Dinić, Jožef Stefan Institute, Slovenia



------------------------------------------------------------------------------

********************** LINGUIST List Support ***********************
Please consider donating to the Linguist List, a U.S. 501(c)(3) not for profit organization:

https://www.paypal.com/donate/?hosted_button_id=87C2AXTVC4PP8

LINGUIST List is supported by the following publishers:

Bloomsbury Publishing http://www.bloomsbury.com/uk/

Cambridge University Press http://www.cambridge.org/linguistics

Cascadilla Press http://www.cascadilla.com/

De Gruyter Brill https://www.degruyterbrill.com/?changeLang=en

Edinburgh University Press http://www.edinburghuniversitypress.com

European Language Resources Association (ELRA) http://www.elra.info

John Benjamins http://www.benjamins.com/

Language Science Press http://langsci-press.org

Lincom GmbH https://lincom-shop.eu/

MDPI Languages https://www.mdpi.com/journal/languages

MIT Press http://mitpress.mit.edu/

Multilingual Matters http://www.multilingual-matters.com/

Narr Francke Attempto Verlag GmbH + Co. KG http://www.narr.de/

Netherlands Graduate School of Linguistics / Landelijke (LOT) http://www.lotpublications.nl/

Peter Lang AG http://www.peterlang.com

SIL International Publications http://www.sil.org/resources/publications


----------------------------------------------------------
LINGUIST List: Vol-37-1702
----------------------------------------------------------



More information about the LINGUIST mailing list