23.799, Review: Computational Linguistics: Smith (2011)

Fri Feb 17 03:33:46 UTC 2012

LINGUIST List: Vol-23-799. Thu Feb 16 2012. ISSN: 1069 - 4875.

Subject: 23.799, Review: Computational Linguistics: Smith (2011)

Moderators: Anthony Aristar, Eastern Michigan U <aristar at linguistlist.org>
            Helen Aristar-Dry, Eastern Michigan U <hdry at linguistlist.org>

Reviews: Veronika Drake, U of Wisconsin-Madison
Monica Macaulay, U of Wisconsin-Madison
Rajiv Rao, U of Wisconsin-Madison
Joseph Salmons, U of Wisconsin-Madison
Anja Wanner, U of Wisconsin-Madison
       <reviews at linguistlist.org>

Homepage: http://linguistlist.org

The LINGUIST List is funded by Eastern Michigan University,
and donations from subscribers and publishers.

Editor for this issue: Monica Macaulay <monica at linguistlist.org>
================================================================  

This LINGUIST List issue is a review of a book published by one of our
supporting publishers, commissioned by our book review editorial staff. We
welcome discussion of this book review on the list, and particularly invite
the author(s) or editor(s) of this book to join in. If you are interested in 
reviewing a book for LINGUIST, look for the most recent posting with the subject 
"Reviews: AVAILABLE FOR REVIEW", and follow the instructions at the top of the 
message. You can also contact the book review staff directly.

===========================Directory==============================  

1)
Date: 16-Feb-2012
From: Benjamin Börschinger [benjamin.boerschinger at googlemail.com]
Subject: Linguistic Structure Prediction

-------------------------Message 1 ---------------------------------- 
Date: Thu, 16 Feb 2012 22:32:17
From: Benjamin Börschinger [benjamin.boerschinger at googlemail.com]
Subject: Linguistic Structure Prediction

E-mail this message to a friend:
http://linguistlist.org/issues/emailmessage/verification.cfm?iss=23-799.html&submissionid=4540987&topicid=9&msgnumber=1

Discuss this message: 
http://linguistlist.org/pubs/reviews/get-review.cfm?subid=4540987

Announced at http://linguistlist.org/issues/22/22-3195.html 

AUTHOR: Smith, Noah A. 
TITLE: Linguistic Structure Prediction 
PUBLISHER: Morgan & Claypool Publishers 
YEAR: 2011

Benjamin Börschinger, Department of Computing, Macquarie University

SUMMARY

In this book, Noah Smith sets out to ''bridge the gap between NLP [Natural
Language Processing] and ML [Machine Learning]'' (p. xv). It is targeted at
''graduate students, NLP researchers wanting to better understand ML, and ML
researchers wanting to better understand NLP,'' and as such, assumes a ''[b]asic
familiarity'' (p. xv) with at least one of the areas.

Smith proposes the general framework of 'linguistic structure prediction' as an
approach to NLP problems that makes clear the connections to ML. Within this
framework, Smith discusses a wide variety of techniques and approaches to both
supervised and unsupervised learning in considerable technical detail. He covers
both important 'traditional' ideas such as the semiring approach in NLP, by now
'standard' approaches such as Support Vector Machines (SVMs) and Maximum Entropy
models, and the general idea of and differences between generative and
discriminative models.

Chapter one introduces some typical NLP tasks, ranging from Part-of-Speech
tagging to syntactic parsing. While providing a basic overview of NLP problems
for people with a primary background in ML, Smith also shows how to formulate
them as instances of a generic ML problem, showing the connections between the
two fields. He proposes the abstract ''linguistic structure prediction framework''
that is made up of four steps:

First, ''[f]ormally define the inputs and outputs'' (p. 20). Smith provides
concrete examples how to do this in cases such as Part-of-Speech tagging or
Syntactic Parsing, thus also introducing in an intuitive way notation that is
used consistently throughout the remainder of the book. Second, ''[i]dentify a
scoring function over input-output pairs, and an algorithm that can find the
maximum-scoring output given an input'' (p. 20), the topic of Chapter two. Third,
identify potential data that actually allows one ''to learn to predict outputs
from inputs, and apply a learning algorithm'' (p. 20), discussed further in
Chapters three to five. Finally, ''evaluate the model on an objective criterion
measured on unseen test data'' (p. 21), a point that is addressed in one of the
four appendices.

Chapter two starts with an explicit definition of the scoring function mentioned
in step two. A scoring function needs to assign a single real number (a score)
to every possible input-output pair <I,O> for the given task; for example, in
Part-of-Speech (PoS) tagging it needs to assign a score to every possible pair
of word and PoS sequences of the same length. Given a formal representation of
both input and output along the lines of Chapter one, these representations are
first mapped to a feature vector, an ordered sequence of real numbers.  As in
Chapter one, Smith provides concrete and intuitive examples of how to define a
feature function for specific NLP tasks, giving people with primarily a ML
background an idea of what relevant features can look like in NLP. Given a
feature vector V, a score is then simply assigned to the corresponding
input-output pair <I,O> by taking the dot product between the vector V and a
given weight vector W. Thus, problems in NLP can be stated as standard
optimization problems: for a given input I, find the output O such that the
product of the corresponding V and a given weight vector W is as large as
possible. Obviously, this raises two further questions, namely how one can
efficiently find the optimal output value and where the weight vector W comes
from. The rest of Chapter two is dedicated to the first problem, finding the
optimal output if one already knows the weight vector, for which Smith
introduces the technical term 'decoding'.

Before addressing how the problem of decoding can efficiently be solved,
however, Smith introduces his 'five views on decoding' that serve to provide
more intuitive ways to think about the rather abstract formulation of the
scoring function as just a product of vectors. In particular, Smith shows how
apparently different approaches such as Probabilistic Graphical Models, Parsing
or Weighted Logic Programs can all be seen as (not necessarily equivalent)
instantiations of the abstract idea of decoding just outlined, and that certain
tasks are more naturally using one or the other.

The rest of Chapter two provides an exhaustive discussion of Dynamic
Programming, one of the key ideas behind many efficient algorithms that have a
long history in NLP. Even for people familiar with specific instances of Dynamic
Programming, e.g. the CKY or the Viterbi algorithm, Smith's presentation is
interesting in that he uses the mathematical idea of a semiring to provide a
very general look at Dynamic Programming. His use of important NLP algorithms as
examples again makes it easy for people already familiar with them to understand
the abstract approach, and provides people without a strong background in NLP
with a clear understanding of some of the core algorithms of the field.

Finally, chapter two includes a quick overview of approximate algorithms that
can be used to solve problems for which no efficient Dynamic Programming
formulation is known, or in cases where solving the DP exactly is still
impractical, a situation that often arises in practice.

Chapter three addresses the problem of finding a weight vector W, assuming many
correct input-output pairs are available. This situation is known as supervised
learning, and the first half of Chapter three provides a detailed discussion of
how probabilistic models, both generative and conditional, can be used in this
setting. Smith succeeds in making clear the difference between generative and
conditional probabilistic models and their relative merits and shortcomings, as
well as in relating them back to the abstract notion of decoding developed in
Chapter two. He starts with basic and ''traditional'' approaches such as Hidden
Markov Models, Probabilistic Context Free Grammars and Naïve Bayes, again
providing an opportunity to get used in a 'familiar' setting to the intricate
notation and the not exactly trivial mathematical details that necessarily come
up in this context. It is also worth pointing out that Smith provides an
introduction to the use of Dirichlet Priors as an elegant approach to smoothing,
an idea that has become increasingly popular in the NLP community over the last
years.

The second half of chapter three discusses fully discriminative models, in
particular large margin methods, and Smith succeeds in explaining their general
difference from probabilistic models and why one may choose to use one or the
other. Starting with the very intuitive Perceptron algorithm, Smith goes on to
sketch the more powerful framework of Support Vector Machines, pointing out the
challenges these models pose in learning weight vectors from the training data.

Chapter four addresses the problem of finding a weight vector W in cases where
one does not have any input-output pairs to learn from, commonly called
unsupervised learning. It provides an excellent discussion of the Expectation
Maximization algorithm, and how it can be applied to probabilistic generative
models for unsupervised learning. Particularly noteworthy is the discussion of
Bayesian Unsupervised Learning, a very recent trend in NLP. Smith both makes
very clear the general difference between the Bayesian and the 'standard'
(frequentist) approach and provides an explanation of the relevant formal
background, including a quick discussion of Markov Chain Monte Carlo methods and
Variational Inference.

In addition to unsupervised learning, Chapter four provides a short but
insightful discussion of Hidden Variable Learning. This is the case where one
actually has input-output pairs to learn from, as in Chapter three, but assumes
that there is unknown hidden structure on the input side that is useful in
predicting the output. Importantly, Hidden Variable Learning can also be applied
to conditional and discriminative models whereas 'classical' unsupervised
learning cannot, thus providing a connection to the discussions in Chapter three.

Chapter five provides additional mathematical details about some of the
subproblems that arose in Chapters three and four. It is the mathematically most
challenging part of the book but can be used more like a reference for specific
mathematical questions that arise when actually implementing one of the
techniques discussed in the previous chapters.

Finally, there are four appendices that provide additional background on some of
the topics mentioned in the main text. The first appendix discusses numerical
optimization methods, particularly important for Chapters three and four. The
second appendix discusses experimental methodology in NLP, thus addressing the
important question of how to do evaluation. The third and the fourth appendix
elaborate on Chapters three and four, discussing Maximum Entropy models and the
difference between locally and globally normalized probabilistic models.

EVALUATION

''Linguistic Structure Prediction'' provides an exhaustive and detailed discussion
of ML techniques that are currently used within NLP. It is clearly not an
introductory book for people with absolutely no background in either field, but
given that it defines its own target audience as graduate students and
researchers, this is not a real criticism. The only thing that I would consider
'missing' is a discussion of directed probabilistic graphical models, as Smith
focuses exclusively on undirected ones. Again, however, this is something that
Smith points out himself, and his reason for focusing on the undirected case,
''general methods for decoding are [...] easier to understand in the undirected
case'' (p. 29), is reasonable.

Particularly noteworthy is how up-to-date the book is. Even very recent
developments are covered, e.g. Bayesian non-parametric approaches, and the
extensive bibliography makes it easy for the reader to identify relevant and
recent papers. Researchers in NLP will welcome it as a useful and up-to-date
reference book for ML techniques, in particular given that most of the
literature in ML uses examples that are often hard to connect back to the
discrete and richly structured world of NLP problems.

To conclude, then, one can say that ''Linguistic Structure Prediction'' does a
good job in bringing ML and NLP closer together.

ABOUT THE REVIEWER 

Benjamin Börschinger received his M.A. in Philosophy and Computational
Linguistics from the University of Heidelberg and is currently working
towards his PhD in Computational Linguistics at Macquarie University,
Australia. His research focuses on computational models of human language
acquisition and unsupervised learning of linguistic structure.
Additionally, he is interested in conceptual questions that arise at the
interface of Linguistics and other disciplines, in particular the Cognitive
Sciences. 

-----------------------------------------------------------
LINGUIST List: Vol-23-799	
----------------------------------------------------------