30.4380, Review: Computational Linguistics; Text/Corpus Linguistics: Cohen (2019)

Mon Nov 18 18:17:36 UTC 2019

LINGUIST List: Vol-30-4380. Mon Nov 18 2019. ISSN: 1069 - 4875.

Subject: 30.4380, Review: Computational Linguistics; Text/Corpus Linguistics: Cohen (2019)

Moderator: Malgorzata E. Cavar (linguist at linguistlist.org)
Student Moderator: Jeremy Coburn
Managing Editor: Becca Morris
Team: Helen Aristar-Dry, Everett Green, Sarah Robinson, Peace Han, Nils Hjortnaes, Yiwen Zhang, Julian Dietrich
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org

Homepage: http://linguistlist.org

Please support the LL editors and operation with a donation at:
           https://funddrive.linguistlist.org/donate/

Editor for this issue: Jeremy Coburn <jecoburn at linguistlist.org>
================================================================

Date: Mon, 18 Nov 2019 13:17:15
From: Brett Drury [brett.drury at gmail.com]
Subject: Bayesian Analysis in Natural Language Processing

Discuss this message:
http://linguistlist.org/pubs/reviews/get-review.cfm?subid=36530717

Book announced at http://linguistlist.org/issues/30/30-1843.html

AUTHOR: Shay  Cohen
TITLE: Bayesian Analysis in Natural Language Processing
SUBTITLE: Second Edition
SERIES TITLE: Synthesis Lectures on Human Language Technologies edited by Graeme Hirst
PUBLISHER: Morgan & Claypool Publishers
YEAR: 2019

REVIEWER: Brett Mylo Drury

SUMMARY

Introduction

Probabilistic reasoning and analysis is a popular subfield of machine learning
applied to Natural Language Processing (NLP). One field of probability,
Bayesian Statistics, can offer unique techniques that can be exploited by NLP
practitioners and academic researchers. A timely publication by Shay Cohen
provides an in-depth explanation of Bayesian Analysis applied to NLP.

Chapters

This review is centred around the chapters of the book. The first of which is
Preliminaries, which is an explanation of basic Bayesian statistics and
related principles. The areas covered in the chapter include random variables
as well as conditional distributions. This chapter is not for the novice to
the field and should be considered as a refresher for researchers that are
familiar with Bayesian Statistics because Cohen addresses fundamental
principles at a breathtaking speed.  For example, Directed Acyclic Graphs
(DAGS) are dismissed in two pages. To his credit, Cohen directs the reader to
standard texts in the area, but the admission that Bayesian Networks are not
addressed in depth is a pity because Bayesian Networks and its associated
augmented Naive Bayes classifier have a role to play in Bayesian Analysis of
natural language.

The Introduction chapter is the second of eight chapters, and in this chapter,
Cohen seeks to differentiate language exploration, which is the secondary
theme of this book, from NLP. Cohen claims that language exploration deals
with the understanding of language, whereas NLP: ''learns and perform
inferences with data''.  This definition is not authoritative, and NLP 
practitioners and researchers who work in fields such as Natural Language
Understanding (NLU) will disagree with these definitions. It seems that the
aforementioned statement was made to justify the remaining content of the
book. Although the chapter is called Introduction, a more accurate title would
have been Latent Diadict Allocation (LDA) because it is the dominant theme of
the chapter. Except for a small number of pages which are an introduction to
and justification of Bayesian methods in NLP, the remainder of the chapter is
a dedicated explanation and demonstration of LDA. The chapter provides the
theoretical justification of LDA, and its advantages over bag-of-words
representation. It is debatable whether this justification still holds because
the majority of the NLP community has moved over to use word vectors. Although
the advantages of using of LDA have faded with the introduction of word
vectors and language models, the examples provided by Cohen give a good
illustration of principles that are in use in Bayesian statistics, which will
be relied upon later on in the book.

Priors are fundamental to Bayesian statistics, and that is the name of the
third chapter.  Priors are: “distributions over a set of hypothesis”.  Priors
are essentially pre-existing beliefs about a domain or problem.  The chapter
covers the following: conjugate priors, priors over multinomial and
categorical distributions, non-informative priors as well as conjugacy and
exponential models. The explanations for each of the described priors is
clear, and each section ends in a summary. This was a comprehensive handling
of priors, and more attention was paid by Cohen to this area compared with
similar books such as the one by Barber (Bayesian reasoning and machine
learning).

Chapter Four is concerned with Bayesian Estimation. Bayesian estimation is the
bedrock on which Bayesian Inference is based. Bayesian Inference, which many
NLP practitioners are directly or indirectly familiar with infers a
''posterior distribution from data'' using the ''parameters of a model''.  The
principle aim according to Cohen of Bayesian estimation is to summarise the
posterior distribution, rather than capturing the distribution in full. The
ultimate aim of the technique is to provide interpretable inferences about a
specific problem or domain.  Cohen provides some example problems such as
synaptic tree generation and sentence alignment which can be addressed using
Bayesian estimation.

The main areas that these chapters cover are:

1. Learning with Latent Variables
2. Bayesian Point Estimation
3. Empirical Bayes 
4. Asymptotic Behaviour of the Posterior

Learning with Latent Variables is an opinion section by the author where he
defines the two main methods of inference from data, which are: using all the
observed data and splitting the data into training and test sets. Cohen
differentiates these strategies as analogous to unsupervised and supervised
learning strategies. Cohen concentrates on the second scenario, where Cohen
claims that Bayesian Point Estimation provides a suitable compromise between
the computational expensive fully Bayesian approach and the need for efficient
lightweight models that make inferences upon unseen data points. This
justification acts a segway into the Bayesian Point Estimation of the chapter.

Point estimation is a technique from statistics where a single value is
computed from a data sample. This value acts as the best estimation for a
given parameter. Point estimation is achieved using point estimators. The
focus of a number Bayesian Point Estimators is the central tendency of the
posterior distribution. The central tendency can be estimated using techniques
such as Posterior Mean and Median.  And as stated in the previous section
Bayesian Estimators can be used in the inference process.

Cohen states that the goal of Bayesian Point Estimation as: ''summarising the
posterior over the parameters into a fixed set of parameters'', and he links
this goal to a frequentist approach known as maximum likelihood estimation. 
Cohen states that Bayesian maximum a posterior estimation(MAP) is a suitable
technique. The remainder of the section describes the mathematical principles
of MAP as well its ability to adhere to the Minimum Message Length principle
which is an encapsulation of  Occam's Razor.  The section also includes
sub-sections on smoothing (default probabilities for words that are absent
from the sample data) and regularization as well as the computation of MAP
with latent variables. The section finishes with decision-theoretic point
estimation which introduces the notion of Bayes Risk. which introduces the
idea of a loss function.

The remaining sections are quite short when compared with point estimation.
Empirical Bayes discusses a method of encoding information into
hyperparameters. And the final section briefly discusses the consequence of
sampling from the distribution to which the model does not belong.

Sampling Methods is the next chapter. It describes some data sampling
techniques that are used to estimate the posterior. These techniques need to
be used when the posterior probability cannot be represented or efficiently
computed. Sampling draws samples from the underlying distribution, and from
these samples, a distribution can be inferred.  

The chapter concentrates upon the Monte-Carlo family of sampling techniques,
and in particular Markov Chain Monte Carlo (MCMC) And as part of this focus
the chapter starts with an overview of the family of MCMC techniques.

The sampling technique for Bayesian statistics that most people are familiar
with is Gibbs sampling, and a large chunk of the chapter is devoted to this
technique. It also discusses a variant of the technique - Collapsed Gibbs
Sampling. In both cases, a comprehensive mathematical treatment is given, as
well as the differences between the two techniques. Cohen highlights the
drawback of the Gibbs sampling, which is that it can be computationally
expensive. Consequently he describes a method for parallelising the technique
across multiple processors. The chapter also considers non-Gibbs sampling MCMC
techniques such as Metropolis-Hastings, Slice Sampling and Simulated
Annealing. The chapter concludes with a discussion about the convergence of
MCMC Algorithms and some theory about Markov Chains as well as a brief
discussion about alternatives to MCMC sampling techniques.

The variational inference chapter considers an alternative technique to that
of approximate inference. Variational inference approaches the problem of
estimating the posterior as an ''optimisation problem''.  Cohen states that
variational inference borrows concepts from maths that are concerned with the
minimisation and maximisation of functionals.  The chapter is broken down into
the following subsections:

1. Variational Bound on Marginal Log-Likelihood
2. Mean-Field Approximation
3. Mean-Field Variational Inference 
4. Dirichlet-Multinomial Variational Inference
5. Connection to the Expectation-Maximization Algorithm
6. Empirical Bayes With Variational Inference

The chapter is concluded with a discussion of the contents of the chapter and
the main points that were covered.

The variational bound subsection steps through some calculations for what
Cohen describes as a typical scenario. Cohen describes the Mean-Field
approximation as a technique that describes an approximate posterior group
that has a factorized form.  He also states that in common with Gibbs Sampling
that the technique requires a partition of latent variables.  The partitions
are of random variables. Cohen goes on to describe the factorized form of the
random variables. The Mean-Field Variational Inference algorithm section
describes an algorithm which is typically used in the context of the
mean-field approximation. The subsection provides pseudo code as well as an
explanation for each phase of the algorithm. The  Dirichlet-Multinomial
Variational Inference subsection describes an application of the Mean-Field
Variational Inference as applied to Dirichlet-Multinomial models.  Connection
to the Expectation-Maximization Algorithm subsection describes the connection
between the Mean-Field Variational Inference algorithm and the Enterprise
Maximization algorithm. And finally, the chapter describes the variational
algorithm in an Empirical Bayes setting.

Nonparametric priors are the next chapter where Cohen motivates the use of
nonparametric Bayesian Modelling by providing an example of mapping clusters
to cluster-specific distributional properties of a word drawn from the
vocabulary under study. He states that there are two issues with this type of
arrangement, it may generate  too few clusters,  which will not capture a
sample large enough to represent the majority of the clusters, or it may
generate too many clusters which will also capture the noise in the document
collection.  Cohen states that a way to represent this arrangement is to use
nonparametric Bayesian Modelling that uses a nonparametric prior. A
nonparametric prior Cohen reminds us is ''a set of random variables indexed by
an infinite, linearly ordered set''.  Cohen then provides an example of the
Dirichlet process which uses a nonparametric prior to define a distribution
over a set of distributions. The chapter continues with the Dirichlet process
and provides various views of the process which include: Stick-Breaking, and
Chinese restaurant.  The chapter also provides a discussion of Direct process
mixture models, which are a ''generalisation of the finite mixture model''.
The discussion is mainly based around inference with DPMMs which would include
Monte Carlo Markov Chains (MCMC) and Variational Inference. The chapter ends
with the Hierarchical Dirichlet Process and the Pitman-Yor Process as well as
a discussion.  

The next chapter in the book is Bayesian Grammar Models, which Cohen claims is
one of the most successful applications of Bayesian strategies to NLP. This
chapter is mainly focused around probabilistic context-free grammars. Cohen
provides some justifications for this approach, which include that
context-free grammars are relatively simplistic, and the research literature
is relatively complete.  The first approach addressed is Hidden Markov Models
(HMMs) which Cohen claims are a special form of context-free grammars.  Cohen
provides a short description of HMMs as well as the mathematical
formalisation. The chapter then addresses probabilistic context-free grammars.
The author describes the link between phrase-structure tree and context-free
grammars. The author provides the mathematical formalisation as well as
discussion concerning PCFGS as well as their inference algorithms. The
remainder of the chapter covers Bayesian context-free grammars, adaptor
grammars, HDP-PCFGS, Synchronous Grammars and Multilingual learning. Each of
these sections as well as the chapter in general builds upon the concepts
described early on in the book.

The final chapter is Representational Learning and Neural Networks.  Choen
starts the chapter by describing the advance of representational learning, as
well as the conditions required for representational techniques, and their
applications. The chapter describes Neural Networks as they are a form of
representational learning. The first part of the chapter describes the history
of Neural Networks, and why they have become popular now. The second part of
the chapter refers to word embeddings. Word embeddings is a form of vector
representation of words and their co-occurrences with other words. As most
practitioners and researchers know that word vectors can be generated by
word2vec. As this is a Bayesian book, there is a technique described based
upon a Bayesian version of word2vec. A large amount of the book describes
modern-day Neural Networks, and their training techniques, and activation
functions as well as the use of word embeddings with Neural Networks. There is
a brief discussion of Neural Networks that can remember time-steps such as
Long Term Short Term Memories as well as Gated Recurrent Units.  The chapter
ends with a discussion of tuning Neural Networks as well as Generative
Modelling with Neural Networks.

Conclusion

It is difficult to place the audience for this book. It is information-dense,
but is a little short on introduction, and therefore is not suitable for
beginners or novices to this field. I found myself jumping to reference books,
and rereading sections to understand the author's point. There is also a
significant amount of material omitted which I hoped that the author may cover
such as Bayesian Networks. Additionally, the Neural Network chapter felt a
little dated and forced to meet the Bayesian remit of the book. Large language
models such as BERT have replaced word vectors for most practitioners. It was
also surprising that there was no mention of Bayesian Neural Networks. On the
plus side, this book dramatically improved my theoretical understanding of
several areas of Bayesian analysis. If you are working in the area and have a
strong grasp of the area then this book may be useful. If you are a novice
then there are books you need to read before this one.

ABOUT THE REVIEWER

Brett is a Senior Data Scientist based in Porto, Portugal who works for Skim
Technologies. He has a PhD from the University of Porto. His current research
interests are causal and logical inference from information in text. He can be
contacted at brett at skim.it

------------------------------------------------------------------------------

***************************    LINGUIST List Support    ***************************
 The 2019 Fund Drive is under way! Please visit https://funddrive.linguistlist.org
  to find out how to donate and check how your university, country or discipline
     ranks in the fund drive challenges. Or go directly to the donation site:
               https://iufoundation.fundly.com/the-linguist-list-2019

                        Let's make this a short fund drive!
                Please feel free to share the link to our campaign:
                    https://funddrive.linguistlist.org/donate/

----------------------------------------------------------
LINGUIST List: Vol-30-4380	
----------------------------------------------------------