36.3894, Confs: 19th Workshop on Building and Using Comparable Corpora at LREC 2026 (Spain)

Thu Dec 18 15:05:02 UTC 2025

LINGUIST List: Vol-36-3894. Thu Dec 18 2025. ISSN: 1069 - 4875.

Subject: 36.3894, Confs: 19th Workshop on Building and Using Comparable Corpora at LREC 2026 (Spain)

Moderator: Steven Moran (linguist at linguistlist.org)
Managing Editor: Valeriia Vyshnevetska
Team: Helen Aristar-Dry, Mara Baccaro, Daniel Swanson
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org

Homepage: http://linguistlist.org

Editor for this issue: Valeriia Vyshnevetska <valeriia at linguistlist.org>

================================================================

Date: 17-Dec-2025
From: Reinhard Rapp [reinhardrapp at gmx.de]
Subject: 19th Workshop on Building and Using Comparable Corpora at LREC 2026

19th Workshop on Building and Using Comparable Corpora at LREC 2026
Short Title: BUCC 2026

Date: 11-May-2026 - 11-May-2026
Location: Palma de Mallorca, Spain
Meeting URL: https://comparable.lisn.upsaclay.fr/bucc2026/

Linguistic Field(s): Computational Linguistics; Text/Corpus
Linguistics; Translation

Submission Deadline: 28-Feb-2026

In the language engineering and linguistics communities, research in
comparable corpora has been motivated by two main reasons. In language
engineering, on the one hand, it is chiefly motivated by the need to
use comparable corpora as training data for data-driven NLP
applications such as statistical and neural machine translation, or
cross-lingual retrieval. In linguistics, on the other hand, comparable
corpora are of interest because they enable cross-language discoveries
and comparisons. It is generally accepted in both communities that
comparable corpora consist of documents that are comparable in content
and form in various degrees and dimensions across several languages.
Parallel corpora are on the one end of this spectrum, and unrelated
corpora are on the other. Increasingly, these resources are not only
collected, but also augmented or even created synthetically, which
raises new questions about how to define and measure comparability.
In recent years, the use of comparable corpora for pre-training Large
Language Models (LLMs) has led to their impressive multilingual and
cross-lingual abilities, which are relevant to a range of
applications, including information retrieval, machine translation,
cross-lingual text classification, etc. The linguistic definitions and
observations related to comparable corpora are crucial to improve
methods to mine such corpora, to assess and document synthetic data,
and to improve cross-lingual transfer of LLMs. Therefore, it is of
great interest to bring together builders and users of such corpora.
Panel Discussion:
The panel discusses the impact of synthetic data on comparable corpora
research. Fundamental questions about how LLMs transform our
understanding and use of multilingual data are addressed.
Topics:
We solicit contributions on all topics related to comparable (and
parallel) corpora, including but not limited to the following:
Building Comparable Corpora:
 - Automatic and semi-automatic methods, including generating
  comparable corpora using LLMs
 - Methods to mine parallel and non-parallel corpora from the web
 - Tools and criteria to evaluate the comparability of corpora
 - Parallel vs non-parallel corpora, monolingual corpora
 - Rare and minority languages, within and across language families
 - Multi-media/multi-modal comparable corpora
Synthetic Data for Comparable Corpora:
 - LLM generation of comparable/parallel data
 - Improving comparability of synthetic data
 - Incidental bilingualism & pre-training use of comparable data
 - Comparability & cross-lingual consistency
 - Detection & attribution of synthetic vs. human text
 - English-centric effects & fairness across languages/scripts
 - Evaluation & reproducibility for downstream tasks
Applications of Comparable Corpora:
 - Human translation
 - Language learning
 - Cross-language information retrieval & document categorization
 - Bilingual and multilingual projections
 - (Unsupervised) machine translation
 - Writing assistance
 - Machine learning techniques using comparable corpora
Mining from Comparable Corpora:
 - Cross-language distributional semantics, word embeddings and
pre-trained multilingual transformer models
 - Extraction of parallel segments or paraphrases from comparable
corpora
 - Methods to derive parallel from non-parallel corpora (e.g. to
provide for low-resource languages in neural machine translation)
 - Extraction of bilingual and multilingual translations of single
words, multi-word expressions, proper names, named entities,
sentences, paraphrases etc. from comparable corpora.
 - Induction of morphological, grammatical, and translation rules from
comparable corpora
 - Induction of multilingual word classes from comparable corpora
Comparable Corpora in the Humanities
 - Comparing linguistic phenomena across languages in contrastive
linguistics
 - Analyzing properties of translated language in translation studies
 - Studying language change over time in diachronic linguistics
 - Assigning texts to authors via authors' corpora in forensic
linguistics
 - Comparing rhetorical features in discourse analysis
 - Studying cultural differences in sociolinguistics
 - Analyzing language universals in typological research
Important Dates:
28 Feb 2026: Paper Submission deadline
22 Mar 2026: Notification of acceptance
29 Mar 2026: Camera-ready final papers
14 Apr 2026: Workshop Programme final version
11 May 2026: Workshop date
All deadlines are 11:59PM UTC-12:00 (“anywhere on earth”).
For updates of the schedule, please see the workshop website.
Practical Information:
The workshop is a hybrid event, both in-person and online. Workshop
registration is via the main conference registration site, see
https://lrec2026.info/
The workshop proceedings will be published in the ACL Anthology
(https://aclanthology.org/).
Submission Guidelines:
Please follow the style sheet and templates (for LaTeX, Overleaf and
MS-Word) provided for the main conference at
https://lrec2026.info/authors-kit/
Papers should be submitted as a PDF file using the START conference
manager at https://softconf.com/lrec2026/BUCC2026/
Submissions must describe original and unpublished work and range from
4 to 8 pages plus unlimited references. Reviewing will be double
blind, so the papers should not reveal the authors' identity. Accepted
papers will be published in the workshop proceedings.
Double submission policy: Parallel submission to other meetings or
publications is possible but must be notified to the workshop
organizers by e-mail immediately upon submission to another venue.
For further information and updates, please see the BUCC 2026 web page
at https://comparable.lisn.upsaclay.fr/bucc2026/.
Workshop Organizers:
Reinhard Rapp (University of Mainz, Germany)
Ayla Rigouts Terryn (Université de Montréal, Mila, Canada)
Serge Sharoff (University of Leeds, United Kingdom)
Pierre Zweigenbaum (Université Paris-Saclay, CNRS, France)
Contact: reinhardrapp (at) gmx (dot) de
Programme Committee:
Ebrahim Ansari (Institute for Advanced Studies in Basic Sciences,
Iran)
Eleftherios Avramidis (DFKI, Germany)
Gabriel Bernier-Colborne (National Research Council, Canada)
Kenneth Church (VecML.com, USA)
Patrick Drouin (Université de Montréal, Canada)
Alex Fraser (Technical University of Munich, Germany)
Natalia Grabar (CNRS, University of Lille, France)
Amal Haddad Haddad (Universidad de Granada, Spain)
Kyo Kageura (University of Tokyo, Japan)
Natalie Kübler (Université Paris Cité, France)
Philippe Langlais (Université de Montréal, Canada)
Yves Lepage (Waseda University, Japan)
Shervin Malmasi (Amazon, USA)
Michael Mohler (Language Computer Corporation, USA)
Emmanuel Morin (Nantes Université, France)
Dragos Stefan Munteanu (RWS, USA)
Preslav Nakov (Mohamed bin Zayed University of AI, United Arab
Emirates)
Ted Pedersen (University of Minnesota, Duluth, USA)
Reinhard Rapp (University of Mainz, Germany)
Ayla Rigouts Terryn (Université de Montréal & Mila, Canada)
Nasredine Semmar (CEA LIST, Paris, France)
Serge Sharoff (University of Leeds, UK)
Richard Sproat (Sakana.ai, Tokyo, Japan)
Marko Tadić (University of Zagreb, Croatia)
François Yvon (CNRS & Sorbonne Université, France)
Pierre Zweigenbaum (Université Paris-Saclay, CNRS, France)
Information About the LRE 2026 Map and the "SHARE YOUR LRs!"
Initiative:
When submitting a paper from the START page, authors will be asked to
provide essential information about resources (in a broad sense, i.e.
also technologies, standards, evaluation kits, etc.) that have been
used for the work described in the paper or are a new result of the
research.
Moreover, ELRA encourages all LREC authors to share the described LRs
(data, tools, services, etc.) to enable their reuse and replicability
of experiments (including evaluation ones).

------------------------------------------------------------------------------

********************** LINGUIST List Support ***********************
Please consider donating to the Linguist List, a U.S. 501(c)(3) not for profit organization:

https://www.paypal.com/donate/?hosted_button_id=87C2AXTVC4PP8

LINGUIST List is supported by the following publishers:

Bloomsbury Publishing http://www.bloomsbury.com/uk/

Cambridge University Press http://www.cambridge.org/linguistics

Cascadilla Press http://www.cascadilla.com/

De Gruyter Brill https://www.degruyterbrill.com/?changeLang=en

Edinburgh University Press http://www.edinburghuniversitypress.com

John Benjamins http://www.benjamins.com/

Language Science Press http://langsci-press.org

Lincom GmbH https://lincom-shop.eu/

MIT Press http://mitpress.mit.edu/

Multilingual Matters http://www.multilingual-matters.com/

Narr Francke Attempto Verlag GmbH + Co. KG http://www.narr.de/

Netherlands Graduate School of Linguistics / Landelijke (LOT) http://www.lotpublications.nl/

Peter Lang AG http://www.peterlang.com

----------------------------------------------------------
LINGUIST List: Vol-36-3894
----------------------------------------------------------