36.1463, Confs: Talking Data (Italy)
The LINGUIST List
linguist at listserv.linguistlist.org
Thu May 8 17:05:02 UTC 2025
LINGUIST List: Vol-36-1463. Thu May 08 2025. ISSN: 1069 - 4875.
Subject: 36.1463, Confs: Talking Data (Italy)
Moderator: Steven Moran (linguist at linguistlist.org)
Managing Editor: Justin Fuller
Team: Helen Aristar-Dry, Steven Franks, Joel Jenkins, Daniel Swanson, Erin Steitz
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org
Homepage: http://linguistlist.org
Editor for this issue: Erin Steitz <ensteitz at linguistlist.org>
================================================================
Date: 06-May-2025
From: Caterina Mauri [caterina.mauri at unibo.it]
Subject: Talking Data
Talking Data
Theme: Methodological and theoretical challenges raised by spoken
interaction data
Date: 09-Oct-2025 - 10-Oct-2025
Location: Bologna, Italy
Contact: Caterina Mauri
Contact Email: caterina.mauri at unibo.it
Meeting URL:
https://site.unibo.it/divers-ita/en/outreach-and-events/talking-data
Linguistic Field(s): Computational Linguistics; Language
Documentation; Pragmatics; Sociolinguistics; Text/Corpus Linguistics
Submission Deadline: 20-May-2025
Organizing Committe:
Caterina Mauri, Eleonora Zucchini, Silvia Ballarè, Ludovica Pannitto.
Scientific Committee:
All the members of the PRIN 2022 PNRR DiverSIta – Diversity in Spoken
Italian:
Cecilia Andorno, Silvia Ballarè, Beatrice Bernasconi, Claudia
Borghetti, Massimo Cerruti, Paolo Antonio Della Putta, Eugenio Goria,
Nicola Grandi, Guglielmo Inglese, Yahis Martari, Caterina Mauri,
Ludovica Pannitto, Rosa Pugliese, Eleonora Zucchini.
Confirmed invited speakers:
Robbie Love (Aston University)
Lorenza Mondada (University of Basel)
Marlou Rasenberg (Radboud University)
Stefan Schnell (University of Zurich)
The conference aims to gather scholars working on data of spoken
interaction from a variety of perspectives, with different approaches
and goals, across different linguistics fields. We are especially
interested in contributions addressing how this type of data raises
both methodological and theoretical challenges all along the way, from
collection, through transcription, to annotation and analysis.
The conference is the closing event of the project DiverSIta,
Diversity in Spoken Italian, which is dedicated mainly to the
expansion of KIParla (Mauri et al. 2019, www.kiparla.it ) a corpus
aimed to document spoken Italian over time, in its internal diversity
of speakers and communicative situations, with a focus on naturally
occurring data (Ballarè, Mauri & Goria 2022). The conference will
represent an opportunity to describe the corpus and the whole KIParla
enterprise, learn about further resources, in different languages,
sharing the focus on spoken interaction data; participants will have
the chance to discuss the theoretical and methodological challenges
that this type of data raises in various fields and approaches to the
study of language, and find common or complementary objectives to
pursue.
Notoriously, collecting, transcribing, and publishing data of spoken
interaction pose more challenges than building resources portraying
written or spoken but monological data, therefore for many years
spoken corpora were limited to so-called WEIRD and LOL languages, i.e.
languages with standardized written forms (Literate), official
recognition (Official), and large speaker populations (Lots of users)
(Dahl 2015). Only recently did we start to have access to resources
containing spoken data for a variety of languages that includes less
described ones, although a significant portion of such data consists
of monological narratives (cf. MULTICast Haig & Schnell, 2015; SCOPIC,
Barth & Evans 2021; Dingemanse & Lisenfeld 2022; DoReCo, Seifart,
Paschen & Stave 2024).
Access to spoken data is crucial for various linguistic analytical
perspectives that focus on language variation in a broad sense.
Observing spoken interaction, despite its inherent messiness and
unpredictability, is essential for developing comprehensive and
accurate descriptions of language as it is truly used in real-life
contexts. This approach helps mitigate biases toward overly polished
or artificially structured data, allowing for a more authentic
representation of linguistic diversity.
We welcome contributions discussing the issues, solutions, and
challenges in building, annotating, using and comparing corpora of
spoken interaction data, also in a cross-disciplinary perspective,
highlighting the role of this specific type of data in shaping
linguistic analyses, linguistic models, and methodological choices. A
non-exhaustive list of topics includes:
Methodologies: corpus design, data collection, transcription,
annotation and publication
- Sampling and balancing: reconciling the representativeness and
spoken data
- Ecological and ethical practices for data collection
- Challenges and possible solutions for manual or (semi-)automated
transcription
- Data formats and standards
- Data annotation: units of transcription, units of analysis,
disfluencies, co-constructions, multilingual interactions, …
- Data FAIRness and accessibility: privacy protection and data sharing
- Main problems and solutions for multilingual corpora annotation
- Treebanks of spoken interactional data: how to deal with overlapping
or utterance co-construction, ...
- LLM training based on conversational data and LLM interactional
performance evaluation
…
Analysis: Spoken interaction data in different approaches
- Language variation and spoken data: how interaction shapes internal
variation
- Sociolinguistic perspectives on spoken data: to what extent can
social categories explain variation in spoken language?
- Typological approaches to spoken interaction data, e.g. universal
vs. language-specific phenomena, available resources
- Computational approaches to spoken interaction data, e.g. LLM
training and fine-tuning, automatic detection of interactional
phenomena
- Diachronic approaches to spoken interaction data, e.g. emergent
constructions, studies highlighting the role of dialogical interaction
in language change
- Studies on interactional data involving L2 speakers or speakers with
multilingual repertoires: e.g. what can we learn about language
acquisition and learners’ varieties from this type of data; how the
presence of L2 speakers or speakers with complex repertoires shapes
language in interaction.
- Psycholinguistic approaches, e.g. experimental settings involving
spoken interactions
…
Submission information
Abstract submission: please send a one-page abstract (references
excluded) in PDF to caterina.mauri at unibo.it and
eleonora.zucchini2 at unibo.it
Deadline for abstract submission: 20th May
Notification of acceptance: 31st May
References
Barth, Danielle & Nicholas Evans (eds). 2017-2021. Social Cognition
Parallax Interview Corpus (SCOPIC). “Language Documentation &
Conservation Special Publication” 12. Honolulu, University of Hawai'i
Press.
Dingemanse Mark & Andreas Liesenfeld. 2022. From text to talk:
Harnessing conversational corpora for humane and diversity-aware
language technology. In Proceedings of the 60th Annual Meeting of the
Association for Computational Linguistics. Dublin, Association for
Computational Linguistics, pp. 5614–5633.
Dobrovoljc, Kaja. 2022. Spoken Language Treebanks in Universal
Dependencies: an Overview. In Proceedings of the Thirteenth Language
Resources and Evaluation Conference. Marseille, European Language
Resources Association, pp. 1798–1806.
Haig, Geoffrey & Stefan Schnell (eds.). 2015. Multi-CAST: Multilingual
corpus of annotated spokentexts. (multicast.aspra.uni-bamberg.de/).
Mauri Caterina, Silvia, Ballare, Eugenio Goria, Massimo Cerruti &
Francesco Suriano. 2019. KIParla corpus: A new resource for spoken
Italian. In CEUR Workshop Proceedings, CEUR-WS 2481, pp. 1 – 7.
Mauri, Caterina, Silvia Ballarè, Eugenio Goria & Massimo Cerruti.
2022. Il corpus KIParla. In Corpora e studi linguistici. Milano,
Officinaventuno, pp. 109 – 118.
Seifart, Frank, Ludger Paschen & Matthew Stave (eds.). 2024. Language
Documentation Reference Corpus (DoReCo) 2.0. Lyon, Laboratoire
Dynamique Du Langage (UMR5596, CNRS & Université Lyon 2).
Wenyi Yu, Changli Tang, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li,
Lu Lu, Zejun Ma & Chao Zhang. 2024. Connecting speech encoder and
large language model for asr. In Proceedings of IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.
12637-12641.
------------------------------------------------------------------------------
********************** LINGUIST List Support ***********************
Please consider donating to the Linguist List to support the student editors:
https://www.paypal.com/donate/?hosted_button_id=87C2AXTVC4PP8
LINGUIST List is supported by the following publishers:
Bloomsbury Publishing http://www.bloomsbury.com/uk/
Cambridge University Press http://www.cambridge.org/linguistics
Cascadilla Press http://www.cascadilla.com/
De Gruyter Mouton https://cloud.newsletter.degruyter.com/mouton
Edinburgh University Press http://www.edinburghuniversitypress.com
Elsevier Ltd http://www.elsevier.com/linguistics
John Benjamins http://www.benjamins.com/
Language Science Press http://langsci-press.org
Lincom GmbH https://lincom-shop.eu/
Multilingual Matters http://www.multilingual-matters.com/
Netherlands Graduate School of Linguistics / Landelijke (LOT) http://www.lotpublications.nl/
Oxford University Press http://www.oup.com/us
Wiley http://www.wiley.com
----------------------------------------------------------
LINGUIST List: Vol-36-1463
----------------------------------------------------------
More information about the LINGUIST
mailing list