36.2213, Reviews: Deep Learning for Natural Language Processing: Mihai Surdeanu, Marco Antonio Valenzuela-Escárcega (2024)

Sat Jul 19 20:05:02 UTC 2025

LINGUIST List: Vol-36-2213. Sat Jul 19 2025. ISSN: 1069 - 4875.

Subject: 36.2213, Reviews: Deep Learning for Natural Language Processing: Mihai Surdeanu, Marco Antonio Valenzuela-Escárcega (2024)

Moderator: Steven Moran (linguist at linguistlist.org)
Managing Editor: Valeriia Vyshnevetska
Team: Helen Aristar-Dry, Mara Baccaro, Daniel Swanson
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org

Homepage: http://linguistlist.org

Editor for this issue: Helen Aristar-Dry <hdry at linguistlist.org>

================================================================

Date: 19-Jul-2025
From: Michael B. Maxwell [mmaxwell at umd.edu]
Subject: Mihai Surdeanu, Marco Antonio Valenzuela-Escárcega (2024)

Book announced at https://linguistlist.org/issues/36-147

Title: Deep Learning for Natural Language Processing
Subtitle: A Gentle Introduction
Publication Year: 2024

Publisher: Cambridge University Press
           http://www.cambridge.org/linguistics
Book URL: https://cambridge.org/9781009012652

Author(s): Mihai Surdeanu, Marco Antonio Valenzuela-Escárcega

Reviewer: Michael B. Maxwell

SUMMARY
This book describes Natural Language Processing (NLP) as of around
2023; namely, "deep learning" (where "deep" refers to the number of
layers in the computer implementation of a neural network, not to any
abstract notion of deep knowledge).  There are other approaches to
NLP, but as the authors say, deep learning is the favorite, and likely
to remain so for the near future.
The preface lays out the aim of the book: "to bridge the theoretical
and practical aspects of deep learning for natural language
processing" (p. xvii)--unlike other works, which the authors say
concentrate on either the theory behind deep learning, or the
practical issues of using machine learning software.  The authors list
the prerequisites that the reader should already know: linear algebra,
differential calculus, probability, and the Python programming
language.  Extensive knowledge of these fields is not required--a
single semester in each of these topics should suffice.  It also helps
to have some experience with Jupyter notebooks, but this can be easily
picked up.  (See my "Computational Notes" at the end of this review
for some other considerations.)
The first chapter outlines the remaining chapters, but also says what
the book does not cover: evolutionary (genetic) algorithms, symbolic
algorithms, Bayesian approaches, and analogic reasoning.  And while
deep learning has been used for many applications, the book only
describes its use for NLP.  The authors also explicitly lay out
drawbacks of deep learning: it is opaque (you can't easily look at the
learned networks to see how they work), it can be brittle, and it
lacks common sense.
The rest of the book roughly alternates between chapters describing
the mathematics of neural network approaches, and chapters giving the
computational implementation of each approach using Python and machine
learning libraries.  The theoretical chapters end with sections on
drawbacks of the methods described in the chapter (frequently pointing
to later chapters where other methods will address those drawbacks),
the historical background of the methods, and references and
additional readings; all chapters end in a short chapter summary.  The
code, including occasional bug fixes, can be downloaded as Jupyter
files from the authors' website.
The chapters after the first chapter are as follows:
2: Perceptrons: these are the simplest neural network architecture.
Perceptrons are not actually deep learning architectures themselves,
but the exposition serves to familiarize the reader with the chief
component of neural networks, as well as linear algebra underlying the
more sophisticated approaches.  The chapter also briefly introduces
evaluation measures for binary outputs (yes or no for a given data
point).
3: Logistic regression: whereas perceptrons use a discrete decision
function to assign a value to a datapoint during training, linear
regression replaces this with a smooth (continuous) update function:
when the machine learning algorithm makes a decision in training that
is not perfectly right or perfectly wrong, the logistic regression
algorithm makes a small change to the decision function.  This chapter
also introduces cost functions, which tell how far off a predicted
value is for a given data point, and gradient descent, which adjusts
predictions as the training data is processed.  (The math is slightly
hairy, and the authors say the less mathematical reader may choose to
skip forward a few sections.)  Finally, the chapter introduces
multiclass decision problems--where there are more than two possible
values--and their evaluation metrics.
4: This chapter is an implementation chapter (the previous two were
theory chapters), in which the perceptron and logistic regression
algorithms are each used for a text classification problem.  The
perceptron implementation performs binary classification (assigning
good/bad ratings for movie reviews), while a couple different of the
logistic regression algorithms are tested on both binary and
multiclass classification tasks (the latter illustrated by the
assignment of one of several topic labels to news articles).
5--7: Feed-Forward Neural Networks use back-propagation in multi-layer
(hence "deep") networks to learn non-linear decision boundaries,
something which the algorithms discussed in previous chapters are
incapable of.  (The notion of non-linear boundaries is difficult to
describe in words, but the authors illustrate it with diagrams.)
Chapter 5 provides the theory (the math is again slightly hairy),
while Chapter 7 (short, since a machine learning library handles the
hairy math) gives the implementation.  The intervening Chapter 6,
entitled "Best Practices in Machine Learning", addresses the perennial
issue that while in theory, theory and practice are the same, in
practice they are not.  The difficulties for deep neural networks
include issues such as (potentially) slow convergence to an answer,
and too close adherence to random bits of training data.  Since these
problems arise more on the practical than the theoretical side, the
present state of the art does not provide theoretical solutions.
8--9: Chapter 8 introduces the Word2vec algorithm, which creates a
semantic-ish representation of words in a multidimensional vector
space (typically a few hundred dimensions).  This is actually one of
the more linguistic-y ideas used in natural language processing,
dating back to observations by Harris (1954) and Firth that "You shall
know a word by the company it keeps!" (Firth 1957: 11).  That idea
only became practical to implement with the recent advent of large
corpora and computational methods to process those corpora.  (It must
also be said that this gets at certain aspects of word meaning, like
synonymy, but it is unclear how it works above the word level.)
The ninth chapter uses pre-trained vector representations
("embeddings") to illustrate uses of this technology.
10--11: These chapters introduce the theory of Recurrent Neural
Networks (RNNs), and apply them to the task of part of speech (POS)
tagging.  While the techniques described in earlier chapters exhibit
limited sensitivity to context, RNNs make it possible to explicitly
take into account the actual sequence of words.  This is not of course
the same as building a phrase structure parse of a sequence of words,
although it can be seen as a precursor to that task (since grammatical
categories of words must be known in order to construct a phrase
structure that includes those words).
12--13: Chapter 12 introduces the theory of Transformer Networks,
which in advanced forms (and huge amounts of computing) underlie Large
Language Models (LLMs).  While Transformer Networks bear some
resemblance to algorithms like Word2vec algorithm, they differ in
other ways.  For instance, embeddings (semantic representations) are
contextualized relative to surrounding words, which allows modeling
polysemy.  Actually, transformer Networks operate on tokens, which are
often pieces of words.  Linguists will immediately think of morphemes,
but in fact these tokens are derived automatically (as described in
the chapter), and may or may not be morphemes.
The computational implementation of Transformer Networks in Chapter 13
uses a pre-built model to improve on the text classification and part
of speech taggers developed earlier.
14--15: The sequence-to-sequence programs so far (such as part of
speech tagging) have created elements in the output sequences which
appear in the same order as input tokens.  For obvious reasons,
applications such as machine translation need to output tokens in a
different sequence from the input sequence.  Encoder-decoder
approaches, the topic of these two chapters, address that issue.  As
usual, the first chapter discusses the theory, while the second walks
the reader through the use of a pre-trained deep neural net machine
translation system, followed by fine tuning of that method for use in
the reverse translation direction.
16: The final chapter briefly views several applications of deep
neural nets to natural language processing, some covered in previous
chapters and some not: text classification, part of speech tagging,
named entity recognition, dependency parsing, relation extraction
(such as finding owners of an organization in texts), question
answering, and machine translation.
The first appendix provides a brief overview of the Python programming
language, probably suited most to someone who has learned another
programming language, or who has taken a Python class but forgotten
the details.  The second appendix briefly discusses character encoding
schemes, especially Unicode.
EVALUATION
Who should read this?  This is a book about Natural Language
Processing (NLP), not computational linguistics (or any kind of
linguistics)--that is, about engineering, not science, at least as
linguists understand that term.  So whether you should read this book
depends on what you want to do with your career (or your free time).
If you want to describe under-described languages with sparse textual
data, this book may not for you, since deep learning methods require
large amounts of data.  But mostly the book is for those who want to
learn how NLP is done currently, and it explains this well.
The book is available (for pay) as a PDF on the publisher's website,
one file per chapter--meaning that you can't search all the chapters
at once.  The website offers to zip up multiple chapters, however when
I asked for more than a few chapters I got broken zipfiles.
Fortunately, one can download one chapter at a time.
While explanations of algorithms are reasonably clear, they are brief,
and the reader may be unclear why some decisions were made.  For
example, in the second chapter, the average perceptron algorithm is
given (in pseudocode).  This algorithm creates a binary classifier by
training on a set of data points until it makes correct predictions.
When it makes an incorrect prediction, the algorithm must update its
internal vector (representing a number of weights, and constituting a
hyperplane cutting data points into two subsets) using this data
point.  It also averages the current vector with a sort of running
total vector, and this averaged vector is what is output by the
algorithm at the end of training.  But what is unclear in the text is
why, when the perceptron has made an incorrect decision, the current
vector appears to be averaged into the running total vector *before*
it is corrected to take account of that data point.  A student can ask
the instructor, but the solitary reader may be left wondering.
Some topics could not be covered, lest the book become huge: dialog,
multimodal data, speech recognition and production.  Also not covered
are topics that might be more career relevant to linguists, such as
annotation, data cleaning, and synthetic data.  Speaking of data
cleaning, one of the datasets used for the text classification
programs contains news articles each annotated for one of eight
topics.  My inspection of this data showed it to be noisy in several
ways:
1) While most of the texts were written in English, some were in other
languages, which constitute noise for the algorithm.
2) Some texts contain non-Unicode characters, presumably in a
pre-Unicode encoding.  One could guess the encoding and convert these
texts to Unicode, but it was easier just to omit them.
3) Each text was tagged for a single topic.  In some cases, the same
article appeared more than once with different metadata (e.g.
attributed to different news agencies).  When this happened, the same
article was sometimes assigned to different topics.
4) The dataset's annotation was described in a published paper as
mostly relying on the categorization at the news websites where the
texts were found.  Since there were multiple news sites, there is no
guarantee that these different sites categorized articles in the same
way, or even used the same full set of categories.  Moreover, the
paper said that if a news site did not categorize an article, a topic
was assigned using a Bayesian classifier, adding to potentially
inconsistent categorization.
There are techniques for ensuring accurate annotation, and there are
datasets suitable for text classification that are likely more
consistent than the dataset used.  That said, since the purpose of
this book is to illuminate deep learning, the drawbacks of the corpus
are not that important.
As with any book on Artificial Intelligence, some topics will
inevitably become outdated, and that quickly; ChatGPT does not make an
appearance.  There is no help for that, but this book will give you
much of the background needed to understand recent developments, which
are firmly in the deep learning realm.
Typos are few.  Some are in cross-references among sections, and some
are in bibliographic citations; the errors appear in both the print
and PDF versions (surprising, since these were produced with LaTeX).
URLs pointing to pages on the Internet are sometimes broken--although
given how often web pages are moved without notice, this is hardly
surprising.  There are also some typos in the code samples, as well as
things that don't work due to changes in Python libraries.  I shared
the glitches I found with the authors, who have an errata page on
their website.  That website (the URL is given in the book) also links
to an updated PDF (of the entire book, not just each chapter
individually), where these glitches are corrected.  The website also
links to corrected Jupyter notebooks constituting working versions of
the code in each implementation chapter.  Without the corrected book's
PDF and the Jupyter listings, some things will appear mysterious.  For
example, a variable 'device' appears to be used without being defined
in section 4.2.3 of the print version, but is defined in the book's
PDF on the authors' website and in the corresponding Jupyter notebook.
I have never mentioned a book's typography in a review, but there is a
first for everything.  While the typesetting is excellent, the size of
the book (about nine by six inches) has resulted in tiny characters in
some of the formulas, particularly subscripts and superscripts.  This
problem also arises in the labels in some figures, where my old eyes
needed a magnifying glass.  Of course this problem can be overcome by
referring to the PDF versions, since PDF readers allow enlargement.
If the book is reprinted, I would encourage the publisher to use
larger pages.
I have not taught NLP, but I believe that book would serve as a fine
one semester introductory text.  While it does not include exercises,
assigning simple or extensive modifications to its Jupyter-based
programs would be easy enough.  Side topics could also be added--for
instance, data cleaning (as described above), or implementing
applications briefly discussed in the final chapter.  There are also
many "References and Further Readings" at the end of each theory
chapter that could be used for discussions in more advanced classes.
COMPUTATIONAL NOTES
This section contains notes I made while running the code provided by
the authors.  This section is not a review per se, but it may assist
others who want to use the code.
First, if you are experienced in Python, you can probably work through
the book's examples without much trouble.  If not, you should take at
least one course in Python, since you stand a good chance of running
into problems; of course if you are taking a class, the instructor
should help you work around these.  These are not the fault of the
authors of this book, but can be surprising.  For example, I was
running an exercise from Chapter 9 in which a library was to be
imported from 'gensim.models'.  The 'import' command was not at the
top of the Jupyter file, so I had already run part of that file when I
encountered the error saying the library wasn't installed.  I
pip-installed the library, then tried to re-run that Jupyter cell--but
the new library failed to load with an obscure error message.  After
searching on the web, the problem turned out to be a version conflict
with another library that I had loaded in a previous cell of the
Jupyter module.  Only by restarting the Jupyter file from the
beginning did the libraries load correctly.  Simple solution, but not
obvious.
Familiarity with Jupyter Notebooks is also helpful, but can be picked
up as you go.
Memory: I ran the code on a Windows 11 PC with 32 gigabytes of RAM,
under the Windows Subsystem for Linux (WSL).  Running under WSL means
less memory is available than on a bare Linux machine.  The 'free'
command in the WSL bash terminal says I have about 22 GB free.  I had
to reduce the size of one dataset in order to run some of the
programs.
CPU vs. GPU:  Many of the programs run faster with a suitable GPU.  My
computer does not have one, so I resorted to a CPU.  The authors
suggest using an on-line service that provides access to GPUs; I ran
some of the programs using the free Google Colab.  There was a slight
learning curve; for instance, the programs need to be altered to
provide access to data you upload to your Google account.
REFERENCES
Firth, John. 1957.  "A Synopsis of Linguistic Theory, 1930--1955."
Pp. 1--32 in Studies in Linguistic Analysis.  Oxford: Basil Blackwell.
Harris, Zellig. 1954.  "Distributional Structure."  Word 10: 146--162.
ABOUT THE REVIEWER
Dr. Maxwell is a retired researcher in computational morphology and
other computational resources for low density languages, and in the
evaluation of natural language processing tools, formerly at the
Center for Advanced Study of Language (later named the Applied
Research Laboratory for Intelligence and Security) at the University
of Maryland.  Earlier he did research at the Linguistic Data
Consortium at the University of Pennsylvania, and studied endangered
languages of Ecuador and Colombia with the Summer Institute of
Linguistics.

------------------------------------------------------------------------------

********************** LINGUIST List Support ***********************
Please consider donating to the Linguist List, a U.S. 501(c)(3) not for profit organization:

https://www.paypal.com/donate/?hosted_button_id=87C2AXTVC4PP8

LINGUIST List is supported by the following publishers:

Bloomsbury Publishing http://www.bloomsbury.com/uk/

Cascadilla Press http://www.cascadilla.com/

Language Science Press http://langsci-press.org

MIT Press http://mitpress.mit.edu/

Multilingual Matters http://www.multilingual-matters.com/

Narr Francke Attempto Verlag GmbH + Co. KG http://www.narr.de/

Netherlands Graduate School of Linguistics / Landelijke (LOT) http://www.lotpublications.nl/

----------------------------------------------------------
LINGUIST List: Vol-36-2213
----------------------------------------------------------