35.3402, Review: Cognitive Science, Historical Linguistics, Semantics, Sociolinguistics; Lexical Variation and Change: Schäfer (2024)
The LINGUIST List
linguist at listserv.linguistlist.org
Tue Dec 3 07:05:05 UTC 2024
LINGUIST List: Vol-35-3402. Tue Dec 03 2024. ISSN: 1069 - 4875.
Subject: 35.3402, Review: Cognitive Science, Historical Linguistics, Semantics, Sociolinguistics; Lexical Variation and Change: Schäfer (2024)
Moderator: Steven Moran (linguist at linguistlist.org)
Managing Editor: Justin Fuller
Team: Helen Aristar-Dry, Steven Franks, Joel Jenkins, Daniel Swanson, Erin Steitz
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org
Homepage: http://linguistlist.org
Editor for this issue: Joel Jenkins <joel at linguistlist.org>
================================================================
Date: 03-Dec-2024
From: Martin Schäfer [post at martinschaefer.info]
Subject: Cognitive Science, Historical Linguistics, Semantics, Sociolinguistics; Lexical Variation and Change: Schäfer (2024)
Book announced at https://linguistlist.org/issues/35.577
AUTHOR: Dirk Geeraerts
AUTHOR: Dirk Speelman
AUTHOR: Kris Heylen
AUTHOR: Mariana Montes
AUTHOR: Stefano De Pascale
AUTHOR: Karlien Franco
AUTHOR: Michael Lang
TITLE: Lexical Variation and Change
SUBTITLE: A Distributional Semantic Approach
PUBLISHER: Oxford University Press
YEAR: 2024
REVIEWER: Martin Schäfer
SUMMARY
The book reviewed here presents a framework for assessing lexical
variation and explores the ways in which distributional semantics can
be used for investigations within this framework. It is based on work
in the Quantitative Lexicology and Variational Linguistics research
group at the University of Leuven, with all authors being or having
been part of this research group. Aiming to reach a diverse audience,
it contains theoretical and methodological introductions, as well as
case-studies focusing on specific aspects of the framework (some of
the case studies based on published works, others new). It is
organized into five parts with two chapters each, with each chapter
ending with a "bottom line" section summarizing the main points.
Part I, Theoretical Preliminaries, introduces the descriptive
framework in Chapter 1, followed by an introduction to distributional
semantics in Chapter 2. In Chapter 1, the authors distinguish
different lexicological perspectives, with the semiasiological
perspective focusing on the semantics of how a form is used, and the
onomasiological perspective focusing on how meaning can be expressed
by various lexical items. In addition, they introduce the term lect as
a cover term for all kinds of language varieties, with the lectometric
perspective then interested in measuring the distance between lects.
These concepts are then discussed against the background of reference
and usage oriented linguistics: What happens if notions like salience
and prototypicality are considered? How does this link to vector space
semantics? The first quantitative measures are introduced:
onomasiological profiles and uniformity indices. An onomasiological
profile captures the distribution of synonyms used for the same
concept in a specific variety, the uniformity index for one concept
expresses the usage overlap of the specific terms used for this
concept across two lects. This uniformity measure can be weighted by
relative frequency and also used for sets of concepts. The final two
sections of the chapter are mostly literature pointers covering
specific aspects of language variation research and the place of the
research program within cognitive linguistics.
The second chapter explains the conceptual basis of the distributional
semantics approach used in the book. Importantly, it is count-based
(as opposed to most current vector space models which are based on
machine learning), and the interest is in token-vectors, that is,
vectors that model individual instances of a word in a corpus (as
opposed to type-vectors, which stand for all instances of a word in a
corpus). What this means is explained in detail, with the following
sections showing the links between distributional semantics and other
strains of research in lexical semantics, starting off with a careful
discussion of how one could establish the different senses of a word.
Part II is entirely dedicated to distributional methodology, with its
two chapters focusing on token-based distributional semantics proper
and visual analytics for the analysis of the results. In contrast to
Chapter 2, Chapter 3 goes into the actual details of building a
distributional semantics model. Starting with example sentences
containing the word to be modeled, the authors carefully walk through
the many parameters that can be used in model creation. Once there is
a token-based model, its items, that is, the vectors representing
instances of the actual words in context, can be compared by creating
a matrix of their cosine similarities, and using this matrix as input
for visualization and clustering. The chapter ends with an overview of
all the different parameter settings that are used in the different
studies presented later in the book.
Chapter 4 explains how the distance matrices can be visualized in
2-dimensional space, and which methods can be used to select
representative models for specific lemmata given the many models
typically produced. The main part of the chapter introduces the
NephoVis visualization tool, software designed to explore the
different distributional models created, which can be interactively
compared. In addition, a further software tool for model exploration,
realized as a ShinyApp, is introduced. Both tools come with websites
demonstrating their use.
Part III, semasiological and onomasiological explorations, starts with
a chapter titled "Making sense of distributional semantics". Here, the
methods explained before are demonstrated through analyses of 32 Dutch
nouns, adjectives, and verbs, all represented by around 300 corpus
attestations which have been manually annotated for lexicographic
sense. For each lemma, many different distributional models are built
(differing in their parameter settings). Investigating the
relationship between parameter settings and models for specific
lemmata, it is shown that there is no specific setting that works in
the same way for all lemmata. In terms of the fit of the resulting
models to the different lexicographic senses annotated in the data, it
turns out that clouds, that is, clusters in a specific model, do not
typically directly map to specific senses. Instead, there exist
complicated patterns across the lemmata and models between clusters
and word senses. A careful discussion of four different representative
configurations of clusters in relationship to the lexicographic
senses, internally arranged along collocational dimensions,
constitutes the core of the chapter.
Chapter 6 adds an onomasiological component in investigating two Dutch
near-synonyms, the verbs "vernielen" and "vernietigen" 'to destroy'.
It is shown how the methods introduced so far can be used to analyze
the interchangeability of near-synonyms relative to specific contexts,
and how the incorporation of further metadata can enrich the models.
Then the focus is firmly on the two verbs which were interchangeable
in 19th century Dutch. First, a distributional analysis of the pair in
contemporary Dutch is provided, followed by an analysis of the pair
over time, analyzing data from four earlier time periods (16/17th,
18th, 19th, and 20th century). The results, which are again carefully
discussed, reveal an increase in differences between the two verbs
over time.
Part IV, Lectometric Methodology, turns to the comparison of lects.
Chapter 7 introduces the tools to quantify lectal structure and
change. First, it explains the uniformity measures for concepts and
sets thereof in more detail. Second, in order to investigate the
development of the two Dutch standard languages, Belgian and
Netherlands Dutch, the concepts of hierarchical destandardization,
informalization, and dehomogenization are explained and
operationalized in terms of relations between the different uniformity
measures across lects. Their application is illustrated by a study on
the development of clothing terms in Belgian and Netherlandic Dutch.
To complement the uniformity measure, a measure for lexical success of
words with specific features is introduced, capturing the proportion
of words with specific characteristics in expressions for concepts
from a specific domain (motivated by the success of English loanwords
in the field of information technology).
Chapter 8 goes through the steps involved in exploiting distributional
semantics for lectometry. With the now expected attention to detail, a
bottom-up approach for the selection of near-synonyms is presented
(relying on type-level vectors), followed by a discussion of which
context words to use in the creation of the actual token-based models.
The next section justifies and explains measures taken to keep only
those tokens in the models that actually represent one of the senses
to be investigated. These include automatic steps, e.g. discarding
tokens that are not assigned to a cluster by a clustering algorithm,
but also semi-automatic steps, e.g. annotating part of the data and
keeping only clusters in which one of the intended senses reaches a
pre-defined threshold in the annotated tokens of a cluster. This
leaves one with a set of models that can be further pruned, for
example by restricting the analysis to just those concepts that are at
least modeled by 50% of the models. Importantly, lectometric measures
can then be compared within specific models or across models. The
authors point out that these differences can be thought of as bringing
forward different semantic aspects of the concepts under discussion.
Part V, Lectometric Explorations, contains two studies that use the
approach put forward in part IV. Chapter nine looks at the dynamics of
standardization of Belgian Dutch and Netherlandic Dutch, in part an
extended replication of the study presented in chapter seven but this
time with the new distributional semantics tools. Eight lectal strata
are considered, with formal (quality newspapers) and informal (usenet
groups/tweets) Dutch data from either Belgium or The Netherlands,
either from around 2000, or from 2017-18. After mostly bottom-up
concept selection, 108 token-based vector space models are built for
overall 85 concepts from different word classes. After further
fine-tuning, all remaining models were used in the analysis, which
looks at the behavior of the uniformity indices for all concepts
across the models, at the behavior of each concept and finally also in
more detail at the semantic fields in the noun subset of the data.
While inspection of the different models shows that the measurement
does not depend on parameter choices, looking at the concepts
individually reveals slightly more nuanced differences. Most variation
emerges in the analysis of the noun subset by semantic field. In
comparison to the previous results, the same general trends emerge,
albeit mostly only as tendencies.
Chapter 10, Pluricentricity from a quantitative point of view,
investigates data from six Spanish national varieties. After an
introduction on Spanish as an international language, the corpus and
concept selection is described, again mostly done bottom-up, but this
time on purpose excluding food and clothing schemes because their
variation was deemed too complex to start with. For each of the 146
concepts, 36 models are created, and the model fine-tuning included a
massive annotation exercise, with 60 000 disambiguated tokens used to
select appropriate clusters in the models. Concepts that were
completely uniform were excluded from the analysis. The uniformity
values are analyzed in three ways: across all concepts that had at
least half of the possible models, the complement set of those, and
finally by only looking at the annotated data (without distributional
modeling). Multidimensional scaling is used for visualization,
followed by an analysis relative to lexical fields. A stand-out result
is the distance of Argentina and Spain from the other lects.
Methodologically, the authors demonstrate the use of multidimensional
scaling for visualizing relation between the lects, and convincingly
argue that even though the results from just the annotated tokens and
the concepts with many models are by and large similar, the detailed
model investigations are nevertheless a very valuable addition.
The book ends with a short conclusion, highlighting that what has been
presented is more of a research program than a finished product, and
pointing to further ways in which this program could and should be
developed: extending what has been done in this volume in probing the
effect of parameter settings/workflow variants more systematically,
considering the differences of the count-based models used here with
deep-learning models, and finally, bring in comparisons to referential
and psycholinguistic perspectives on meaning.
In its final section, the software resources used and discussed in the
book are collected.
EVALUATION
I profited very much from reading this book, with the detailed
discussions of the nitty-gritty of the distributional semantics
set-ups and workflows being a highlight. Even so, evaluating this book
in terms of its single-volume readability is a different matter.
According to the authors, the book's unique features are that it
offers a comprehensive overview of lexical variation and a critical
insight into the machinery of distributional modeling, and is
accompanied by the software resources used in conducting the analyses
discussed. From this, the authors suggest that the book will be of
interest to semanticists and lexicologists, computational linguists as
well as sociolinguists, and claim that the text is written with
minimal background assumptions. I am not sure that this is really
true, suggesting that some non-trivial experience with all of these
fields is probably necessary to make this a good reading experience.
I also felt that the structure of the book made reading it less
compelling than it could have been. For example, the first part
contains much detail on lexical variation, but at one point it was not
clear to me anymore why this is immediately relevant to motivate
and/or situate a distributional semantic approach. I was certainly
very happy when the first concrete applications of the distributional
semantics approach was presented in Chapter 5, while the preceding
chapter felt too much like a software walkthrough that perhaps might
be more fruitfully integrated into the software resources section.
In contrast, a very clear and much more important strength of the book
is its careful attention to detail in all sections: the authors
explain complicated things, and they explain them well. I also enjoyed
extended pointers to alternative pathways that could have been chosen
although they eventually were not chosen.
All in all, I do believe that the authors managed to make a compelling
case for their approach, that is, token-based modeling, production of
multiple models, and overall a plaidoyer not to look or even
necessarily believe in their being one best model, but rather to
exploit the different models in order to better understand the nature
of meaning.
Some more form/presentation-related quibbles: the graphs in the book
are produced in gray-scale but discussed in color. While the graphs
are freely available in color (the whole book is open-access and the
pdf with color figures can be downloaded from the publisher’s
website), it is very detrimental to reading the book on paper. And
given that it costs 83 GBP, this restriction is hard to understand. In
general, the editorial quality is good, but sometimes figures are not
fully clear (for example, the parameter abbreviations in figures 5.1
and 5.2), or the reading flow would have benefited from additional
examples or pointers (for example, the reference on p. 148 to figure
5.10, a model of Dutch "grijs" 'grey' on page 131).
ABOUT THE REVIEWER
Martin Schäfer studied general linguistics, ancient Greek and English
at the Universität Leipzig. He worked at many universities in Germany,
with a year as Marie-Skłodowska-Curie-Fellow at the Anglia Ruskin
University, Cambridge thrown in for variety. As of now, he is a
lecturer in English linguistics at the Universität Leipzig. He is
mostly interested in semantics and related areas, and used
distributional semantics in several of his own works.
------------------------------------------------------------------------------
********************** LINGUIST List Support ***********************
Please consider donating to the Linguist List to support the student editors:
https://www.paypal.com/donate/?hosted_button_id=87C2AXTVC4PP8
LINGUIST List is supported by the following publishers:
Bloomsbury Publishing http://www.bloomsbury.com/uk/
Brill http://www.brill.com
Cambridge University Press http://www.cambridge.org/linguistics
Cascadilla Press http://www.cascadilla.com/
De Gruyter Mouton https://cloud.newsletter.degruyter.com/mouton
Edinburgh University Press https://edinburghuniversitypress.com
Elsevier Ltd http://www.elsevier.com/linguistics
Equinox Publishing Ltd http://www.equinoxpub.com/
European Language Resources Association (ELRA) http://www.elra.info
John Benjamins http://www.benjamins.com/
Language Science Press http://langsci-press.org
Lincom GmbH https://lincom-shop.eu/
Multilingual Matters http://www.multilingual-matters.com/
Narr Francke Attempto Verlag GmbH + Co. KG http://www.narr.de/
Netherlands Graduate School of Linguistics / Landelijke (LOT) http://www.lotpublications.nl/
Oxford University Press http://www.oup.com/us
Wiley http://www.wiley.com
----------------------------------------------------------
LINGUIST List: Vol-35-3402
----------------------------------------------------------
More information about the LINGUIST
mailing list