36.1285, Reviews: Linguistics across Disciplinary Borders: Steven Coats, Veronika Laippala (eds.) (2024)
The LINGUIST List
linguist at listserv.linguistlist.org
Thu Apr 17 12:05:02 UTC 2025
LINGUIST List: Vol-36-1285. Thu Apr 17 2025. ISSN: 1069 - 4875.
Subject: 36.1285, Reviews: Linguistics across Disciplinary Borders: Steven Coats, Veronika Laippala (eds.) (2024)
Moderator: Steven Moran (linguist at linguistlist.org)
Managing Editor: Justin Fuller
Team: Helen Aristar-Dry, Steven Franks, Joel Jenkins, Daniel Swanson, Erin Steitz
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org
Homepage: http://linguistlist.org
Editor for this issue: Joel Jenkins <joel at linguistlist.org>
================================================================
Date: 16-Apr-2025
From: Troy E Spier [tspier2 at gmail.com]
Subject: Anthropological Linguistics, Computational Linguistics, General Linguistics, Sociolinguistics: Steven Coats, Veronika Laippala (eds.) (2024)
Book announced at https://linguistlist.org/issues/35-2426
Title: Linguistics across Disciplinary Borders
Subtitle: The March of Data
Series Title: Language, Data Science and Digital Humanities
Publication Year: 2024
Publisher: Bloomsbury Publishing
http://www.bloomsbury.com/uk/
Book URL:
https://www.bloomsbury.com/linguistics-across-disciplinary-borders-9781350362260/
Editor(s): Steven Coats, Veronika Laippala
Reviewer: Troy E Spier
SUMMARY
In an oft-referenced post on Twitter (now ‘X’) approximately one
decade ago, Dan Ariely analogized big data to teenage sex, remarking
that “everyone talks about it, nobody really knows how to do it,
everyone thinks everyone else is doing it, so everyone claims they are
doing it” (qtd. in Reis and Housley, 2023, p. 8). In the case of
Steven Coats and Veronika Laippala’s edited volume, readers are
offered the opportunity to engage with methodological and theoretical
concerns in corpus linguistics without simply ‘following the pack.’
Containing eight formal chapters, forty-five figures, thirty-six
tables, and extensive references, this volume presents analyses of
real-life data through case studies that address stylistic and
discursive concerns.
Despite being the shortest chapter of this volume, the introduction
informs the reader that significant sources of data can be found
through digital media. This refers, in some instances, to finite data,
such as English-language literature published over the course of a
single century, or to potentially infinite data in other instances,
such as content written for and shared on community forums like Reddit
and social media services like Twitter/X. Additionally, the
introduction cursorily describes some of the commonly used approaches
to corpus data (e.g. lexical dispersion and concordancing) and some of
the issues impacting the ready-usability of such data due, for
instance, to the formatting and retrievability of content on the
internet.
Chapter 1 opens with a discussion of the merits of automatic speech
recognition. In eschewing a purely text-based approach, the author
acknowledges the rich source of data available on websites like
YouTube, noting for readers that such data are now readily available
for computational analysis both through manual transcription and
automatic transcription by a human or computer, respectively. Before
differentiating the transcriptional accuracy and
Heimatland-identification of speakers in data from corpora covering
four areas (North America, British Isles, Australia/New Zealand,
Germany), the author explains the legal and programmatic
considerations for readers in their potential acquisition of such
data.
Chapter 2 offers a fundamentally different approach to corpus-based
research than many other introductory textbooks. In contrast to
lower-level, programmatic approaches in languages like Python and R or
more dedicated, statistical approaches in tools like SPSS, the
Konstanz Information Miner (KNIME) is presented as an alternative for
linguists. By enabling them to visualize each step of the analytical
process in a transparent manner, KNIME prevents scholars from making
the mistakes that the authors acknowledge are commonplace among
interdisciplinary research. In other words, in the same way that not
all computational scientists or statisticians will have a firm grasp
of language usage, not all linguists will have a firm grasp of
computational or statistical approaches. To this end, this chapter
exemplifies KNIME through two case studies on markedly different data
sets to illustrate for readers what this process could look like,
careful also to admit the potential shortcomings in the employment of
such a tool.
Chapter 3 begins by noting the inherently interdisciplinary nature of
corpus studies, remarking that the field has increasingly become
intimately intertwined with data science and statistics. Relying on
data from the Corpus of Historical American English (COHA), this
chapter examines the feasibility of supervised document
classification, topic modelling, distributional semantics, and
conceptual maps. Of particular importance are the visual
representations of the data, demonstrating the relationship between
and among other lexical items; and the tabular representations of
topic assignment, both computationally and manually as a way of
comparing accuracy.
Chapter 4 engages existing scholarship on fundamental terminology
(viz. ‘genre’ vs. ‘register’) and the usage of Optical Character
Recognition (OCR) in digitizing texts that will later be used for deep
learning projects. In doing so, the authors establish the foundation
for their automatic register identification, relying on a significant
collection of texts in the Corpus of Founding Era American English
(COFEA) before using BERT to train and tag these texts automatically
on the basis of particular text types, e.g. letters, speeches,
treaties, and essays. Finally, the authors note that such techniques,
while useful in their own right for individual projects, more
importantly can facilitate the creation of new corpora more
consistently.
Chapter 5 expands the conversation in the preceding chapter by
applying a similar approach to a non-Indo-European language. By
consulting the Finnish Internet Parsebank and FinCORE, this chapter
examines whether real-world language usage on the internet can be
reliably classified into registers or subregisters on the basis of
prototypical linguistic characteristics. In this case, though, the
authors had to begin with the annotation process, as pre-existing
materials were not otherwise available. Additionally, they looked at
lemmas, not individual words, in considering the five topics with the
highest probability of reliable classification, noting empirically,
though many know this intuitively, that greater topic diversity often
results in lesser cohesion.
Chapter 6 embarks on a fundamentally different course of action from
the previous chapters. Beginning with a brief introduction to the
Corpus of Global Web-Based English (GloWbE) and the International
Corpus of English (ICE), the author interrogates the notion that
bigger is always better, noting that large corpora sometimes contain
‘messy’ data that obscure what is truly at-hand. As such, this chapter
first considers the existing criticism of large corpora before
suggesting that the ICE, as a collection of smaller, more manageable
corpora, might be used to make more reasonable assessments of the
state and prototypical features of international varieties of English.
This assumption is tested, quite successfully, on a number of
morphosyntactic attributes like nominal suffixation, the presence of
particular pronouns, and phrasal coordination; orthographical
attributes, such as the presence of contractions, hyphenated words,
and URLs; and statistical attributes, such as the percentage of
(non-)standard words and mean word length.
Chapter 7, in stark contrast to the technologically- and
statistically-intensive chapters in this volume, presents a
back-to-the-basics approach–or, as the authors describe it, a “common
sense approach.” By considering three hundred top-level posts on
Twitter (now ‘X’) that include the phrase “working from home,” the
authors are able to demonstrate how tools like AntConc and FireAnt,
which are admittedly more familiar to corpus linguists, can lead to
insightful conclusions about the data. More than this, the authors
employ a spreadsheet to keep track of themes, subthemes, the inclusion
of textual and extratextual information (e.g. images and links), and a
manually-determined sentiment analysis (positive, negative, mixed,
objective). Finally, this chapter includes an incredibly important
discussion surrounding the ethics of including social media data in
research projects.
Chapter 8 considers the important relationship between sex and gender
identity by collating almost forty-four million tokens in posts and
comments on Reddit that include the [IDENTIFY as X] construction.
Noisy data are excluded, such as “[...] identified as a risk factor
for several major cardiovascular diseases” (p. 213). Next, keyword
analysis and concordancing are undertaken to determine which subjects
and complements are invoked for which purpose, e.g. self-positioning,
giving advice, and debating. Relevant exemplars are provided for each
of these pragmatic functions.
EVALUATION
“Linguistics Across Disciplinary Borders: The March of Data” is a
fascinating, compact volume that unifies a number of authors with the
same shared goal: to collect, organize, and utilize data more
effectively in drawing conclusions about who we are and how language
can be used toward specific ends. The breadth of corpora discussed and
utilized is quite significant, as is the extensive inclusion of
references, visualizations, and tables for each chapter. Similarly, an
emphasis upon ‘low-tech’ and ‘high-tech’ tools and approaches was
especially strong. However, a few organizational modifications would
have made this volume more effective.
First, the sixth, seventh, and eighth chapters are, by far,
the most accessible, but they are positioned ineffectively. Had these
appeared at the beginning of the book, for instance, readers would not
immediately confront the technological and statistical terminology
that, as the introduction suggests—though not forcefully enough—are
often prone to abuse: because computers are employed and statistics
are invoked by many scholars as a means to an end without sufficient
understanding of either, this may result in an overemphasis of
p-values or in the assumption that one tool is best. Likewise, the
second chapter presents a streamlined path toward analysis and
acknowledges stopwords, but at least a footnote explaining the
benefits and drawbacks of stopword (in/ex)clusion would have been
welcome here. Instead, readers must first engage with deep learning
and automated text classification before learning about or being
reminded of basic methodological and ethical issues in corpus creation
and the most commonplace tools used by those in the field who may not
have the prerequisite computational background of some of the authors.
Restructuring in this regard would allow for a more natural,
scaffolded approach for understanding—from concrete to abstract, from
manual to computational, from descriptive to inferential.
Second, a number of chapters utilize the term ‘register’ in a
way that, while aligning with its usage in some subfields of
linguistics, diverges from how it is understood within related fields.
For example, what is defined as a ‘register’ here would be a ‘genre’
for rhetoricians and literary scholars or a ‘text type’ for discourse
analysts, while a sociolinguist’s definition of ‘register’ would
differ from them all. While the fourth chapter does an effective job
at defining the terminology for the book’s intended audience, it does
not acknowledge that scholars in adjacent disciplines may interpret
these terms differently (see e.g. Lee 2001, Baker and Ellege 2011,
Wales 2011, and Baldick 2015); thus, although the same chapter does
recognize that even linguists sometimes treat these terms as
synonymous, this straightforward ‘collapsing’ runs the risk of
oversimplifying otherwise complex, field-specific conventions
surrounding nomenclature.
Finally, the title doesn’t seem to reflect the contents: Which
disciplinary borders have been crossed? Each chapter offers a
decidedly interdisciplinary perspective on the data, but the
introduction already clarifies that “researchers working in various
disciplines [...] utilize linguistic material to study how information
is transmitted and spread around the world” (p. 1). The existing title
suggests that this text would be analogous to extant scholarship like
Karsdorp et al. (2021). Nonetheless, based on each autobiographical
blurb printed, almost every author is either an applied linguist or a
computational linguist. Given the exciting borders that this volume
actually crosses, a name like “Corpus Linguistics Across Time, Space,
and Place: The March of Data” might have better captured the journey
that these authors undertake through consideration of historical
language usage, multimodal discourse analysis, international varieties
of English, computer-assisted and manual annotation, ‘old school’ vs.
‘new school’ tools, and more.
REFERENCES
Baker, Paul, and Sibonile Ellege. 2011. Key Terms in Discourse
Analysis. London, UK: Continuum International Publishing Group.
Baldick, Chris. 2015. The Oxford Dictionary of Literary Terms. Oxford,
UK: Oxford University Press.
Karsdorp, Folgert et al. 2011. Humanities Data Analysis: Case Studies
with Python. Princeton, NJ: Princeton University Press.
Lee, David Y.W. 2001. Genres, Registers, Text Types, Domains, and
Styles: Clarifying the Concepts and Navigating a Path Through the BNC
Jungle. Language Learning & Technology, 5(3): 37-72.
Reis, Joe and Matt Housley. 2023. Fundamentals of Data Engineering:
Plan and Build Robust Data Systems. Sebastopol, CA: O'Reilly Media,
Inc.
Wales, Katie. 2011. A Dictionary of Stylistics. New York, NY:
Routledge.
ABOUT THE REVIEWER
Troy E. Spier is Assistant Professor of English and Linguistics at
Florida A&M University. He earned his MA and Ph.D. in Linguistics at
Tulane University, his B.S.Ed. in English/Secondary Education at
Kutztown University, and a graduate certificate in Islamic Studies at
Dallas International University. His research interests include
language documentation and description, discourse analysis, corpus
linguistics, and linguistic landscapes.
------------------------------------------------------------------------------
********************** LINGUIST List Support ***********************
Please consider donating to the Linguist List to support the student editors:
https://www.paypal.com/donate/?hosted_button_id=87C2AXTVC4PP8
LINGUIST List is supported by the following publishers:
Bloomsbury Publishing http://www.bloomsbury.com/uk/
Cambridge University Press http://www.cambridge.org/linguistics
Cascadilla Press http://www.cascadilla.com/
De Gruyter Mouton https://cloud.newsletter.degruyter.com/mouton
Edinburgh University Press http://www.edinburghuniversitypress.com
Elsevier Ltd http://www.elsevier.com/linguistics
John Benjamins http://www.benjamins.com/
Language Science Press http://langsci-press.org
Lincom GmbH https://lincom-shop.eu/
Multilingual Matters http://www.multilingual-matters.com/
Netherlands Graduate School of Linguistics / Landelijke (LOT) http://www.lotpublications.nl/
Oxford University Press http://www.oup.com/us
Wiley http://www.wiley.com
----------------------------------------------------------
LINGUIST List: Vol-36-1285
----------------------------------------------------------
More information about the LINGUIST
mailing list