36.857, Confs: Developing models for linguistic research: Training data usage in low-resource scenarios (Workshop as part of the 2nd "Language & Languages at the crossroads of Disciplines" conference) (France)
The LINGUIST List
linguist at listserv.linguistlist.org
Wed Mar 12 01:05:02 UTC 2025
LINGUIST List: Vol-36-857. Wed Mar 12 2025. ISSN: 1069 - 4875.
Subject: 36.857, Confs: Developing models for linguistic research: Training data usage in low-resource scenarios (Workshop as part of the 2nd "Language & Languages at the crossroads of Disciplines" conference) (France)
Moderator: Steven Moran (linguist at linguistlist.org)
Managing Editor: Justin Fuller
Team: Helen Aristar-Dry, Steven Franks, Joel Jenkins, Daniel Swanson, Erin Steitz
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org
Homepage: http://linguistlist.org
Editor for this issue: Justin Fuller <justin at linguistlist.org>
================================================================
Date: 12-Mar-2025
From: Natasha Romanova [natalia.romanova at unicaen.fr]
Subject: Developing models for linguistic research: Training data usage in low-resource scenarios (Workshop as part of the 2nd "Language & Languages at the crossroads of Disciplines" conference)
Developing models for linguistic research: Training data usage in
low-resource scenarios (Workshop as part of the 2nd "Language &
Languages at the crossroads of Disciplines" conference)
Date: 01-Sep-2025 - 03-Sep-2025
Location: University of Lille, France ESJ Lille, 50, rue
Gauthier-de-Châtillon - 59046 Lille, France
Contact: Natasha Romanova
Contact Email: natalia.romanova at unicaen.fr
Meeting URL: https://llcd2025.sciencesconf.org/resource/page/id/8
Linguistic Field(s): Applied Linguistics; Computational Linguistics;
General Linguistics; Historical Linguistics; Syntax
The advent of deep learning technologies, and in particular Large
Language Models (LLMs) and their multilingual variants, promised to
facilitate a wide variety of labour-intensive expert tasks required to
conduct corpus-based research in linguistics, including but not
limited to transcription of handwritten (Kiessling et al. 2019) and
oral (Michaud et al. 2020) sources, lemmatization (Camps et al. 2022),
syntactic tagging and parsing (Guiller 2020), named entity recognition
(Ortiz Suárez et al. 2020), segmentation of speech (Algayres et al.
2023) and text (Levenson et al. 2024) etc. This can be especially
relevant for low-resource scenarios with limited amounts of available
training data, for example when working with under-resourced (e.g.
minority and endangered) languages or ancient languages, where
few digital corpora exist.
These languages may not have a standardized spelling which can impact
transcription and annotation tasks (transcription of oral corpora
necessitates decisions about transcription guidelines; spelling
variation in written corpora can complicate annotation, etc). Further
examples of low-resource scenarios are learner corpora or language
disorder corpora. In these cases, however, a single round of
fine-tuning of an LLM can often result in output quality that is well
below the state of the art for well-resourced tasks and linguistic
varieties. As shown by Kantharuban et al. (2023) on dialectal data,
there is no one-size-fits-all solution for this issue: the optimal
approach depends on language, task, model, and data type, and can be
affected both by data size and data makeup. Moreover, computing power
needed to train and fine-tune models requires considerable investment
and has a non-negligible environmental impact (Bender et al 2021).
This workshop explores research aimed at optimizing model adaptation
processes (using LLMs or alternatives), in particular by addressing
the distance between training and target corpora in the contexts of
limited linguistic, human and energy resources. These approaches range
from computational methods dedicated to training optimization (e.g.
curriculum learning, Bengio et al. 2009) to linguistically motivated
protocols for training data selection (e.g. Guibon et al. 2015). These
methods take into account not only the amount, but also the makeup and
the quality of training data, which is progressively mobilized during
training, allowing for a gradual adaptation of the model to the target
language or language variety, genre or type of data. The behaviour of
the trained systems during the adaptation process can, in turn, inform
linguists about the distance between the training corpus and the
target corpus.
In this workshop, we would like to address similarities and
differences in the approaches to the design of model adaptation
processes in different areas of linguistic application (transcription,
annotation etc) with three research questions in mind:
1) How can measuring the distance between the training corpus and the
target corpus (linguistic and “extra-linguistic” distance, e.g.
principles of annotation and segmentation) before training guide the
construction of training and fine-tuning datasets?
2) How can the evaluation of learning curves and training performance,
in turn, help measure the distance between training corpora and target
corpora?
3) What is the impact of the interoperability of transcription and/or
annotation standards in the training sets on the performance of the
resulting systems?
The aim of the workshop is to foster dialogue and collaboration
between researchers working on different aspects of model adaptation
in linguistics in order to begin to :
i) develop best practices for the design of
training/fine-tuning/adaptation processes in low-resource scenarios
across disciplines and approaches;
ii) point out problems with existing tools and formulate desiderata
for future tools (both in terms of performance and usability);
iii) establish guidelines for the evaluation of tools and processes
(metrics, corpus size and quality);
iv) identify strategies for the use of linguistic knowledge in the
design of the training/adaptation processes, on the one hand, and for
gaining new linguistic insights through the implementation and testing
of tools and approaches, on the other.
We invite contributions from scholars working towards the
constitution, transcription, annotation and analysis of linguistic
corpora who use, adapt and question the performance of existing
computational tools, including machine learning tools, from
computational linguists interested in the adaptation of models and
from Natural Language Processing (NLP) specialists engaged in dialogue
with linguistics researchers. We particularly welcome presentations on
case studies of the application of model adaptation methodologies in
linguistics.
Call for papers URL:
https://llcd2025.sciencesconf.org/data/pages/Developing_models_for_linguistic_research_training_data_usage_in_low_resource_scenarios_Workshop.pdf
Languages of the workshop will be English and French. Please note that
abstracts for individual contributions must not contain any
information that could identify the authors (name, affiliation,
address, bibliographic references). Submissions must be a maximum of
500 words (including examples, excluding references) and should
clearly present the research questions, adopted approach, methodology,
data, and results. The abstract can be written in English or French
and must be formatted according to the provided template. To submit
your proposal please follow the EasyChair link provided on
https://llcd2025.sciencesconf.org/resource/page/id/8
References
Algayres R., Diego-Simon P., Sagot B., and Dupoux E. (2023) XLS-R
fine-tuning on noisy word boundaries for unsupervised speech
segmentation into words. In Findings of the Association for
Computational Linguistics: EMNLP 2023 Singapore: Association for
Computational Linguistics (pp. 12103–12112) URL:
https://aclanthology.org/2023.findings-emnlp.810/
Bengio, Y., Louradour, J., Collobert, R., & Weston, J. (2009, June).
Curriculum learning. In Proceedings of the 26th annual international
conference on machine learning (pp. 41-48)
Bender E.M., Gebru T., McMillan-Major A. & Shmitchell S. (2021) On the
Dangers of Stochastic Parrots: Can Language Models Be Too Big? In
Conference on Fairness, Accountability, and Transparency (FAccT ’21),
March 3–10, 2021, Virtual Event, Canada. ACM, New York, NY, USA (14
pp.) URL :https://doi.org/10.1145/3442188.3445922
Camps J.-B., Gabay S., Fièvre P., Clérice T., Cafiero F. (2021) Corpus
and Models for Lemmatisation and POS-tagging of Classical French
Theatre. Journal of Data Mining and Digital Humanities (22 pp.) URL :
https://shs.hal.science/halshs-02591388v2
Guibon, G., Tellier, I., Prévost, S., Constant, M., & Gerdes, K.
(2015). Analyse syntaxique de l’ancien français : quelles propriétés
de la langue influent le plus sur la qualité de l’apprentissage ?
TALN22 (13 pp.) URL : https://hal.science/hal-01251006
Guiller, K. (2020). Analyse syntaxique automatique du pidgin-créole du
Nigeria à l’aide d’un transformer (BERT): Méthodes et Résultats.
Mémoire de Master, Sorbonne Nouvelle.
Kantharuban, A., Vulić, I., & Korhonen, A. 2023. Quantifying the
Dialect Gap and its Correlates Across Languages. In Findings of the
Association for Computational Linguistics: EMNLP 2023, Singapore:
Association for Computational Linguistics (pp. 7226–7245) URL:
https://aclanthology.org/2023.findings-emnlp.481/
Kiessling B., Tissot R., Stokes P. and Stökl Ben Ezra D. (2019)
eScriptorium: An Open Source Platform for Historical Document
Analysis. 2019 International Conference on Document Analysis and
Recognition Workshops (ICDARW), Sydney, Australia (pp. 19) URL:
https://inria.hal.science/hal-04030514/document
Levenson M. G., Ing L., Camps J.-B. (2024) Textual Transmission
without Borders: Multiple Multilingual Alignment and Stemmatology of
the “Lancelot en prose” (Medieval French, Castilian, Italian).
Computational Humanities Research (28 pp.)
Michaud A., Adams O., Cox C., Guillaume S., Wisniewski G., et al.
(2020) La transcription du linguiste au miroir de l’intelligence
artificielle : réflexions à partir de la transcription phonémique
automatique. Bulletin de la Société de Linguistique de Paris 115:1
(pp.141-166) URL :
https://shs.hal.science/halshs-02881731
Miletić, A. (2018). Un treebank pour le serbe : constitution et
exploitations [Phdthesis, Université Toulouse le Mirail - Toulouse
II]. URL : https://theses.hal.science/tel-02639473
Ortiz Suárez P.J., Dupont Y., Muller B., Romary L. & Sagot B. (2020)
Establishing a New State-of-the-Art for French Named Entity
Recognition. In Proceedings of the Twelfth Language Resources and
Evaluation Conference, Marseille: France. European Language Resources
Association (pp. 4631–4638) URL:
https://aclanthology.org/2020.lrec-1.569/
Peng, Z., Gerdes, K., & Guiller, K. (2022). Pull your treebank up by
its own bootstraps. In L. Becerra, B. Favre, C. Gardent, & Y.
Parmentier (eds.), Journées Jointes des Groupements de Recherche
Linguistique Informatique, Formelle et de Terrain (LIFT) et Traitement
Automatique des Langues (TAL) (pp. 139‑153). URL :
https://hal.science/hal-03846834
------------------------------------------------------------------------------
********************** LINGUIST List Support ***********************
Please consider donating to the Linguist List to support the student editors:
https://www.paypal.com/donate/?hosted_button_id=87C2AXTVC4PP8
LINGUIST List is supported by the following publishers:
Bloomsbury Publishing http://www.bloomsbury.com/uk/
Cambridge University Press http://www.cambridge.org/linguistics
Cascadilla Press http://www.cascadilla.com/
De Gruyter Mouton https://cloud.newsletter.degruyter.com/mouton
Elsevier Ltd http://www.elsevier.com/linguistics
John Benjamins http://www.benjamins.com/
Language Science Press http://langsci-press.org
Lincom GmbH https://lincom-shop.eu/
Multilingual Matters http://www.multilingual-matters.com/
Netherlands Graduate School of Linguistics / Landelijke (LOT) http://www.lotpublications.nl/
Oxford University Press http://www.oup.com/us
Wiley http://www.wiley.com
----------------------------------------------------------
LINGUIST List: Vol-36-857
----------------------------------------------------------
More information about the LINGUIST
mailing list