37.1390, Reviews: Data-Intensive Investigations of English: Mikko Laitinen; Paula Rautionaho (eds.) (2025)

The LINGUIST List linguist at listserv.linguistlist.org
Fri Apr 10 22:05:02 UTC 2026


LINGUIST List: Vol-37-1390. Fri Apr 10 2026. ISSN: 1069 - 4875.

Subject: 37.1390, Reviews: Data-Intensive Investigations of English: Mikko Laitinen; Paula Rautionaho (eds.) (2025)

Moderator: Steven Moran (linguist at linguistlist.org)
Managing Editor: Valeriia Vyshnevetska
Team: Helen Aristar-Dry, Mara Baccaro, Daniel Swanson
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org

Homepage: http://linguistlist.org

Editor for this issue: Helen Aristar-Dry <hdry at linguistlist.org>

================================================================


Date: 10-Apr-2026
From: Ayano Shigeta-Watanabe [shigetaayano at people.kobe-u.ac.jp]
Subject: Mikko Laitinen; Paula Rautionaho (eds.) (2025)


Book announced at https://linguistlist.org/issues/36-3910

Title: Data-Intensive Investigations of English
Series Title: Studies in English Language
Publication Year: 2025

Publisher: Cambridge University Press
           http://www.cambridge.org/linguistics
Book URL:
https://www.cambridge.org/ch/universitypress/subjects/languages-linguistics/history-english-language/data-intensive-investigations-english?format=HB&isbn=9781009415712

Editor(s): Mikko Laitinen; Paula Rautionaho

Reviewer: Ayano Shigeta-Watanabe

SUMMARY
Data-intensive investigations of English (edited by Mikko Laitinen and
Paula Rautionaho) consists of ten chapters, all of which address
state-of-the-art data-intensive approaches in linguistic research. It
includes an introductory chapter by the editors that presents the
volume, eight chapters by different authors that report on
data-intensive studies across various linguistic disciplines, and a
final chapter that discusses issues in data analysis related to such
approaches. Among the eight studies, some examine linguistic variation
within areas such as dialectology, sociolinguistics, historical
linguistics, and second language acquisition, while others focus on
structural properties such as collocation and word formation. The
analyses are wide-ranging, employing a variety of analytical tools and
methods, and examining variation across multiple linguistic levels.
In Chapter 1 (“Data-intensive approaches to English Linguistics”),
Mikko Laitinen, Paula Rautionaho, and Irene Taipale outline the aims
of the volume and provide summaries of each chapter. The volume seeks
to introduce state-of-the-art data-intensive research in English
linguistics across a wide range of fields. According to the authors,
data-intensive research involves theoretically grounded and
methodologically rigorous analyses of linguistic questions using
large-scale digitized datasets and advanced statistical methods. This
approach represents a development from earlier small-scale,
qualitatively oriented studies, as well as from analyses employing
more basic statistical techniques. By drawing on knowledge and methods
from the digital humanities and computer science, such approaches
enable forms of research that were previously difficult or impossible
to conduct. At the same time, placing data and data analysis at the
center of research presents several challenges, including issues
related to data validation (e.g., representativeness), replicability,
and the appropriateness of analytical methods. The volume therefore
combines studies that showcase recent data-intensive approaches to
linguistics (Chapters 2–9) with a discussion of the broader
methodological and theoretical issues underlying these approaches
(Chapter 10).
In Chapter 2 (“What big data tell us about American English
phonetics”), William A. Kretzschmar Jr., Michael Olsen, and Rachel
Ireland analyze over 500,000 vowel tokens from the Digital Archive of
Southern Speech (DASS) using point pattern analysis to investigate
vowel distributions in Southern American English. By mapping vowel
realizations onto a fine-grained grid and incorporating distributional
density, the study identifies where vowels most frequently occur. The
main goal of this study is to challenge traditional views of the
Southern Vowel Shift (SVS), that is, the idea that multiple vowel sets
undergo a systematic chain shift, sometimes leading to overlap in the
vowel space. These accounts have typically relied on measures such as
the mean (the average value of all data points) and the mode (the most
frequently occurring value). However, the results based on point
pattern analysis show that only certain vowel sets, such as DRESS and
FACE, exhibit convergence in commonly produced speech. Overall, the
chapter demonstrates that analyses relying solely on mean and mode are
insufficient, and that incorporating distributional density provides a
more accurate account of vowel variation and sound change.
Chapter 3 (“Do you reckon? Creating and testing a corpus of spoken
Southern American English from the Digital Archive of Southern Speech
(1970–1983)”), Keiko Bridwell and Katherine Ireland draw on data
extracted from the DASS (the same dataset used in Chapter 2),
enriching the dataset by incorporating detailed syntactic and speaker
demographic information through the Corpus Workbench. They address two
main questions. They first examine the grammaticalization of
“reckon”—compared to similar expressions such as “think” and
“guess”—through its syntactic environments and then consider its
social meaning by analyzing who produces it. They employ
distributional statistics (i.e., rate) to explore both linguistic and
social patterns, complemented by inferential statistics (i.e., linear
regression and pairwise comparison). The results show that “reckon”
most frequently occurs with first-person subjects and rarely appears
with that-clauses. This suggests that it is commonly used as a fixed
expression (“I reckon”) and has partially lost its original lexical
meaning, indicating ongoing grammaticalization. In this respect, it
patterns similarly to “guess”. Furthermore, its distribution among
older speakers, African American speakers, and those from lower
socioeconomic and educational backgrounds suggests that the form is
socially stigmatized.
In Chapter 4 (“‘Scots for the masses’? Utilising a novel data-analysis
facility to statistically explore Late Modern Scots in the digitised
chapbooks collection”), Sarah van Eindhoven, Lisa Gotthard, and Rosa
Filgueira examine the spelling and vocabulary of chapbooks (i.e.
inexpensive books produced for a mass readership) published in
eighteenth- and nineteenth-century Scotland, using materials from the
National Library of Scotland. The authors process these digital
historical texts with defoe, a tool designed for text mining, and
investigate the extent to which Scots-derived forms appear in
comparison with standard English forms across topics, time periods,
location (where the chapbook was printed), and lexical items. The
analysis employs normalized frequencies and conditional inference
trees to identify both the frequency of Scots forms and the factors
influencing their use. The results show a high frequency of Scot
forms, particularly after 1811, in chapbooks from Aberdeen, Ayrshire,
and Edinburgh. Scots forms are more frequent in popular literature,
local news, and songs, whereas English dominates in genres such as
domestic advice and biography. A closer analysis further reveals that
while some lexical items remain relatively stable due to the
Conserving Effect (i.e. lexical entrenchment), others (e.g. “frae” (=
from) and “mair” (= more)) increase in the later period, contributing
to an overall rise in Scots usage and reflecting the Vernacular
Revival, a movement characterized by a renewed interest in and use of
local vernacular forms.
In Chapter 5 (“Combining collocation measures and distributional
semantics to detect idioms”), Gerold Schneider proposes a new method
for identifying noncompositional expressions (i.e., expressions whose
meanings cannot be predicted from their components), such as idioms. A
key contribution of this research is the combination of traditional
collocation measures with distributional semantics. Instead of relying
only on how frequently words co-occur, the method considers semantic
similarity, assuming that words in noncompositional expressions tend
to be semantically dissimilar, as well as lexical fixedness (called
“syndom”), assuming that noncompositional expressions are less
replaceable by semantically similar alternatives. This allows the
model to distinguish more effectively between compositional and
noncompositional expressions. The analysis is based on corpora such as
the British National Corpus and web-based data and examines
constructions involving verbs, prepositions, and nouns. It also
investigates diachronic changes in compound nouns by comparing data
from the 1990s and the 2010s in written British English. The results
show that overall, this combined approach improves the detection of
noncompositional expressions and that compound nouns related to IT and
the internet have increased over time.
In Chapter 6 (“Using data-intensive methods for unlocking expressions
in word formation”), Sabine Arndt-Lappe, Natalia Beliaeva, and Audrey
Martin investigate the expressiveness (i.e., the function of
expressing sentiment toward a referent) of expressions formed from two
human proper names. Two studies are conducted. In the first study,
expressions such as “Borisconi”—where one proper name (“Boris”) is
used to characterize another (“Berlusconi”)—are compared with
non-blended expressions to determine which type exhibits greater
expressiveness. The data are drawn from the News on the Web (NOW)
Corpus and sentiment values are estimated based on the surrounding
linguistic context using tools such as SentiWords (lexicon for
sentiment analysis) and RANGE (software). The results show that
blended forms tend to carry a negative expressive load. In the second
study, data are extracted using the Sketch Engine, and expressions
such as Kimye, where two proper names combine to denote a couple (Kim
and Kanye), are examined in terms of sentiment values across different
registers (press vs. social media). The results indicate that such
expressions tend to carry a positive expressive load, and that this
tendency does not significantly vary across registers.
In Chapter 7 (“Modals of future time reference across native and
non-native Englishes: a variationist analysis”), Paula Rautionaho and
Lea Meriläinen apply a data-intensive approach to research on second
language acquisition. The study focuses on the variation between “be
going to” and “will” and examines the frequency of these forms as well
as the factors that influence this variable choice. In particular, the
authors investigate whether the linguistic factors affecting this
choice in native speakers’ English also operate in non-native
varieties of English (Czech, Finnish, and Taiwanese English). The
study uses three relatively large corpora of native and non-native
speech (the Spoken British National Corpus, the Corpus of Contemporary
American English; the Louvain International Database of Spoken English
Interlanguage) and applies a generalized mixed-model tree to examine
the strength of individual factors and the interactions among them. In
addition to the variation between “be going to” and “will”, the study
also analyzes variation between full and contracted forms within each
variant (“be going to” vs. “gonna” and “will” vs. “’ll” and “won’t”).
The results show that the two European non-native varieties of English
display patterns similar to those of native English—for example, a
preference for “be going to”—whereas Taiwanese English shows different
tendencies, including a high frequency of “will”. The former varieties
show more fine-grained similarities to native varieties; for example,
they use the full form “be going to” in subordinate clauses.
In Chapter 8 (“Bayesian multivariate analysis of complement selection:
subject-control complements of the verb fear”), Juho Ruohonen and
Juhani Rudanko investigate the conditions under which verb “fear”
takes a to-infinitive (e.g., “she often feared to go to the bathroom
at night”) versus a gerund (e.g., “she feared walking a few blocks”).
In selecting predictor factors defoe, a tool designed for text mining
affecting this variation, this paper draws on some linguistic
principles, such as the Complexity Principle, which predicts that more
structurally complex contexts favor explicit forms (the to-infinitive)
and the Choice Principle, which predicts that contexts involving
agentive, voluntary actions are more likely to select the
to-infinitive. The study employs Bayesian multivariate analysis on
American English data from the NOW Corpus (the same corpus used in
Chapter 6), allowing for the simultaneous evaluation of multiple
interacting factors even when the data are skewed or sparse. The
results show that to-infinitives are particularly favored in
extraction contexts (i.e., constructions where an element has been
moved from its usual position, such as in relative clauses), in
situations where a variant follows fearing, and when the understood
subject of the complement clause functions as an Agent.
In Chapter 9 (“Statistical modelling of syntactic complexity of
English academic texts using ensemble machine learning: syntactic
predictors of rhetorical sections”), Maryam Nasseri examines the
syntactic complexity of different rhetorical sections (Abstract,
Introduction, Literature Review, Methods & Methodology, Results &
Discussion, and Conclusion) in English MA dissertations written by
students with different backgrounds, and investigates which syntactic
measures used in previous studies best predict each section. A stacked
ensemble model combining multiple machine learning algorithms is
employed to improve the overall classification accuracy. The results
show that complex nominals per clause and complex T-unit ratio (a
measure of how often sentences include dependent clauses) are among
the strongest predictors across the corpus, with particularly high
frequencies in the Literature Review and Methods sections, indicating
a greater use of nominal and subordinate structures.
In Chapter 10 (“Implications of the replication crisis: some
suggestions to improve reproducibility and transparency in
data-intensive corpus linguistics”), Martin Schweinberger summarizes
key considerations for conducting linguistic research using
data-intensive methods. According to the author, recent corpus-based
analyses employing complex statistical techniques are associated with
the so-called Replication Crisis, which involves issues of
methodological transparency and reproducibility. To ensure that
research can be properly verified and evaluated, platforms for storing
and sharing linguistic data and analytical methods have recently
emerged, and their use is strongly recommended. The author also
suggests the use of open-source programming languages such as R and
Python to facilitate reproducibility. Furthermore, the chapter
emphasizes the importance of proper training for researchers and
introduces resources such as the Language Technology and Data Analysis
Laboratory, along with examples from the author’s own research on
intensifiers.
EVALUATION
This book is a specialized volume that presents cutting-edge
linguistic research employing data-intensive approaches. In each
chapter (Chapters 2–9), the authors draw on relatively large datasets
and apply a range of analytical methods to linguistic inquiry. Each
chapter demonstrates analytical tools and methods and presents the
results through well-designed and clearly structured visualizations,
making the volume a valuable source of insight for students and
researchers who wish to develop quantitative approaches to linguistic
research using programming languages (especially R) and statistical
software. Because it covers a wide range of linguistic features across
different subfields, it is relevant to students and scholars from a
broad range of disciplines. However, as a wide variety of statistical
terms and theories are used throughout, the content may be quite
challenging for beginners.
There are two notable features in this volume. The first is the
analytical rigor demonstrated in each chapter. The studies presented
in Chapters 2–9 employ a wide range of statistical and other
analytical tools, representing a clear methodological advancement over
earlier work that relied on simpler statistical measures (see the
discussion in Chapter 1). This advancement is particularly evident in
Chapter 2, where earlier quantitative analyses of phonetic
features—often based on measures such as the mean or mode—are replaced
by point pattern analysis, offering a more precise perspective on
distributional patterns. Similarly, Chapter 5 moves beyond
conventional frequency-based measures by incorporating distributional
semantics that take the meanings of constituent elements into account,
thereby enabling more effective identification of idiomatic
expressions. A further illustration of this methodological development
is the frequent application of regression analysis or similar analysis
throughout this volume to identify patterns of frequency and the
factors underlying the linguistic choices under investigation (see
Chapters 3, 4, 6, 7, and 8). Regression-based approaches are effective
in identifying the effects of multiple factors simultaneously,
avoiding analytical methods that treat factors in isolation and
thereby obscure their relative contributions, as is often the case in
traditional sociolinguistics (cf. Labov 1972). While the use of
regression analysis in linguistic research is not entirely new, recent
practices differ in important ways from earlier studies. While earlier
approaches, such as those using GoldVarb
(http://individual.utoronto.ca/tagliamonte/goldvarb.html) also
considered the effects of multiple factors, they were subject to
methodological limitations, particularly in the analysis of
interaction effects (i.e., cases where the effect of one factor
depends on the value of another) (cf. Tagliamonte 2006). More recent
regression-based methods, as seen in this volume, have addressed many
of these issues by explicitly incorporating interaction terms through
modern statistical programming environments (e.g., R). As discussed in
Chapter 10, a proper understanding of these recent data-analytical
approaches requires a high level of methodological awareness on the
part of researchers. At the same time, the increasing sophistication
of analytical techniques represents a positive development for the
field, and this volume successfully captures this trend.
Another notable characteristic of this volume is its emphasis on
theoretically grounded analysis within data-intensive research. In
Chapter 1, Mikko Laitinen, Paula Rautionaho, and Irene Taipale
emphasize that the studies included in the volume (Chapters 2–9) aim
to achieve a high level of analytical sophistication by drawing on
linguistic theories rather than relying solely on automated methods—an
approach that distinguishes linguistic research from purely
computational or NLP-driven studies. Strictly speaking, this approach
is not entirely new, as evidenced by multidimensional register
analyses since the 1980s (cf. Biber 1988). However, its importance
warrants renewed emphasis in contemporary English linguistics, given
the recent expansion in the use of large-scale datasets and
increasingly complex analytical techniques, particularly in
sociolinguistics and historical linguistics. Indeed, many of the
chapters investigate the factors underlying linguistic variation by
drawing on insights from previous studies and situating their analyses
within theoretical frameworks such as grammaticalization (Chapter 3),
the Conserving Effect (Chapter 4), and the Complexity Principle and
the Choice Principle (Chapter 8). This emphasis is crucial, as
analyses conducted in a purely automated manner often invite severe
criticism regarding their validity and overall quality—what may be
referred to as “data fetishism” (see Chapter 1).
Overall, I am in favor of the theory-driven, data-intensive approach
adopted in this volume, and I find the book highly valuable. From this
perspective, one key observation emerges: some contributors do not
rely exclusively on complex statistical methods; rather, they also
make use of simpler measures, including percentages and normalized
frequencies. In Chapter 3, advanced methods are employed primarily to
examine external factors, while simple methods are used for internal
factors. In others, statistically simple statistics are presented
alongside, or following, advanced analyses (see Chapters 6 and 7).
There are also instances in which simpler measures are used to examine
the frequency of individual forms, rather than overall trends (see
Chapter 4). In these studies, there appear to be different motivations
for using simple statistics. In some cases, the rationale is more
constrained or pragmatic. For example, authors may rely on simple
statistics for internal factors when the distribution is heavily
skewed toward one form (see Chapter 3). Similarly, there may be
insufficient data for each item to permit the use of more complex
statistical analyses (as may be the case in Chapter 4). In other
instances, however, the rationale is more positive. As seen in
Chapters 6 and 7, simpler statistics can serve as more accessible and
readily interpretable analytical tools.
In my view, when simple and more complex statistical methods yield the
same results, simpler methods can serve as a very useful means of
making the findings of more complex analyses more accessible. However,
when the two approaches produce different results, it becomes less
clear how the simpler statistics should be treated. A common situation
is one in which an effect observed in simple statistics is not
confirmed by more complex analyses. In such cases, scholars tend to
adopt different positions: some give precedence to the simpler results
(often attributing the failure to detect the effect in more complex
analyses to issues such as data limitations, including low frequencies
or register differences), while others prioritize the more complex
analyses and disregard the simpler findings. From a data-intensive
perspective, the latter interpretation is generally favored; however,
in linguistics, it is not uncommon to encounter positions that
privilege simpler statistical evidence. What would you say to those
who privilege simple statistics? This is a question I would like to
pose to the editors and contributors of the volume.
Finally, one potential area for improvement in this volume is its
organization, as the focus on a particular statistical
method—regression analysis—receives comparatively more attention than
other advanced techniques. The inclusion of cluster analysis or
comparable multivariate methods could further enrich the analytical
depth of the studies presented. Traditionally, cluster analysis has
been used to classify World Englishes (e.g., Werner 2013) or
linguistic registers (e.g., Biber & Egbert 2018; Zhang 2019); more
recently, it has been applied in sociolinguistics to group individuals
according to their behavioral patterns (e.g., Haddican et al. 2022;
Travis & Gan 2025). Because it tends to rely on large-scale data and
sophisticated statistical procedures (e.g., the elbow method, the
silhouette score, and the gap statistic), this approach aligns well
with the data-intensive perspective exemplified throughout the volume.
REFERENCES
Biber, Douglas. 1988. Variation across speech and writing. Cambridge:
Cambridge University Press.
Biber, Douglas & Jesse Egbert. 2018. Register variation online.
Cambridge: Cambridge University Press.
Haddican, Bill, Cecelia Cutler, Michael Newman, and Christina Torota.
2022. Cross-speaker covariation across six vocalic changes in New York
English. American Speech 97(4): 512–542.
Labov, William. 1972. Sociolinguistic patterns. Philadelphia:
University of Pennsylvania Press.
Tagliamonte, Sali A. 2006. Analysing sociolinguistic variation, 1st
edn. Cambridge: Cambridge University Press.
Travis, Catherine E. & Qiao Gan. 2025. The intersection of ethnicity
and social class in language variation and change. Language Variation
and Change 37. 137-160.
Werner, Valentin. 2013. Temporal adverbials and the present
perfect/past tense alternation. English World-Wide 34(2). 202-240.
Zhang, Man. 2019. Exploring personal metadiscouse markers across
speech and writing using cluster analysis. Journal of Quantitative
Linguistics 26(4). 267-286.
ABOUT THE REVIEWER
Ayano Shigeta-Watanabe earned her PhD from the University of Sheffield
in 2023 and is currently a part-time lecturer at Kobe University,
Japan. Her research interests are twofold. The first is language
variation and change in spoken English, with a particular focus on the
interplay between internal and external factors. The second is
identity construction in pop music, which was the topic of her PhD
thesis. Her most recent work (2025) examined variation in third-person
“don’t” (e.g., “he don’t work”) in British English, using logistic
regression analysis to investigate contributing factors.
NOTE
In preparing this book review, I used ChatGPT to assist with
grammatical revision.



------------------------------------------------------------------------------

********************** LINGUIST List Support ***********************
Please consider donating to the Linguist List, a U.S. 501(c)(3) not for profit organization:

https://www.paypal.com/donate/?hosted_button_id=87C2AXTVC4PP8

LINGUIST List is supported by the following publishers:

Bloomsbury Publishing http://www.bloomsbury.com/uk/

Cambridge University Press http://www.cambridge.org/linguistics

Cascadilla Press http://www.cascadilla.com/

De Gruyter Brill https://www.degruyterbrill.com/?changeLang=en

Edinburgh University Press http://www.edinburghuniversitypress.com

European Language Resources Association (ELRA) http://www.elra.info

John Benjamins http://www.benjamins.com/

Language Science Press http://langsci-press.org

Lincom GmbH https://lincom-shop.eu/

MDPI Languages https://www.mdpi.com/journal/languages

MIT Press http://mitpress.mit.edu/

Multilingual Matters http://www.multilingual-matters.com/

Narr Francke Attempto Verlag GmbH + Co. KG http://www.narr.de/

Netherlands Graduate School of Linguistics / Landelijke (LOT) http://www.lotpublications.nl/

Peter Lang AG http://www.peterlang.com

SIL International Publications http://www.sil.org/resources/publications


----------------------------------------------------------
LINGUIST List: Vol-37-1390
----------------------------------------------------------



More information about the LINGUIST mailing list