27.2005, Review: Computational Ling; Phonology; Syntax; Text/Corpus Ling: van Zaanen, Heinz, de la Higuera (2015)

Mon May 2 16:02:18 UTC 2016

LINGUIST List: Vol-27-2005. Mon May 02 2016. ISSN: 1069 - 4875.

Subject: 27.2005, Review: Computational Ling; Phonology; Syntax; Text/Corpus Ling: van Zaanen, Heinz, de la Higuera (2015)

Moderators: linguist at linguistlist.org (Damir Cavar, Malgorzata E. Cavar)
Reviews: reviews at linguistlist.org (Anthony Aristar, Helen Aristar-Dry, Robert Coté, Sara Couture)
Homepage: http://linguistlist.org

*****************    LINGUIST List Support    *****************
                       Fund Drive 2016
                   25 years of LINGUIST List!
Please support the LL editors and operation with a donation at:
           http://funddrive.linguistlist.org/donate/

Editor for this issue: Sara  Couture <sara at linguistlist.org>
================================================================

Date: Mon, 02 May 2016 12:01:56
From: Andrew Lamont [alamont at indiana.edu]
Subject: Grammatical Inference for Computational Linguistics

Discuss this message:
http://linguistlist.org/pubs/reviews/get-review.cfm?subid=36139077

Book announced at http://linguistlist.org/issues/26/26-4897.html

AUTHOR: Jeffrey N. Heinz
AUTHOR: Colin  de la Higuera
AUTHOR: Menno M. van Zaanen
TITLE: Grammatical Inference for Computational Linguistics
SERIES TITLE: Synthesis Lectures on Human Language Technologies
PUBLISHER: Morgan & Claypool Publishers
YEAR: 2015

REVIEWER: Andrew Lamont, Indiana University

Co-reviewed by Andrew Lamont and Lwin Moe

Reviews Editor: Helen Aristar-Dry

SUMMARY

Grammatical inference is the task of learning the structure of a formally
defined system. Early on, the authors state that the “purpose of this book is
to introduce computational linguists to the major results of this field and to
its way of thinking” (2). That is, the book overviews what grammatical
inference is and how it is done for an audience without assuming a background
in formal language theory. Chapter 1 presents overview of grammar and
inference, Chapter 2 describes formal learning, Chapters 3 and 4 discuss
empirical learning of regular and non-regular languages respectively, and
Chapter 5 summarizes and enumerates open questions.

Chapter 1 presents a definition of grammar as it is used throughout the book
and outlines the task of inference. The structure of the chapter parallels and
lays out the rest of the book. Inference is broken down into formal and
empirical inference. With the discussion of formal inference, formal language
theory as it relates to natural language is overviewed. The evaluation of
grammatical inference systems is summarized with empirical inference. The
chapter ends with a summary and a section that formally defines important
concepts ranging from alphabets and strings to the Chomsky Hierarchy.

Having defined grammar in Chapter 1, Chapter 2 addresses what inference looks
like. The chapter begins by describing the relationship between the learner,
i.e. the system tasked with inferring the grammar, and the oracle, i.e. the
source of the grammar, and how much information the learner has access to at
any given time. Then the chapter defines fundamental concepts of learning and
major results relating to them from the field. The bulk of the chapter
describes formal grammar formalisms relevant to natural language -
finite-state automata, Hidden Markov models, context-free grammars, and
others. The chapter ends by questioning the relation of grammatical inference
to machine learning in general before a brief summary.

Chapter 3 covers the empirical learning of regular languages. Regular
languages are separated from non-regular languages in the text paralleling the
split between phonology and syntax for linguists. The chapter justifies this
by characterizing the formal split in terms of the learning bias for the class
of regular languages. The chapter defines regular grammar and state-merging,
the main algorithm discussed, and grounds these abstract concepts with
examples drawing from Pintupi and Kwakwala stress patterns. The application of
state-merging for the inference of regular grammars and their relation to
linguistic typology is justified as a learning bias. The chapter also covers a
learning scenario in which negative evidence is available and learning
transducers. The last section before the summary covers learning stochastic
languages, a task familiar to most computational linguists as n-gram modeling.
However, this section highlights regular patterns such as sibilant harmony in
Samala which are not string-local and which therefore require an alternate
inference strategy.

Chapter 4 examines the learning of non-regular languages, specifically
languages that are modeled by context-free grammars such as much of natural
language syntax. The chapter begins by discussing substitutability, the
principle underlying the learning algorithms given in the chapter, as
exemplified by constituent structure in natural language syntax.
Substitutability is broken into expanding and reducing the hypothesis grammar
which relates to state-merging and state-splitting from Chapter 3. The chapter
then overviews several inference systems concisely, detailing their learning
process grouped by whether the systems are expanding or reducing approaches
and their assumed input. Then, several approaches to evaluating the output of
inference are presented with discussion of the advantages and disadvantages of
each approach. Finally before summarizing, there is a discussion of whether
context-free grammars are powerful enough to capture natural language syntax
and if not the difficulty of learning context-sensitive grammars.

Chapter 5 provides a concise summary of the book, discusses the various ways
to approach and understand the problem of learning, and then overviews current
open areas of research along with resources for further study.

EVALUATION

Overall, we feel this book provides a strong introduction to grammatical
inference and is written well for computational linguists and as such
accomplishes exactly what it sets out to do. The book is self-contained,
introducing and defining terms and concepts that readers need to be familiar
with in order to understand the larger points. The book is approachably short,
just over 120 dense pages with references; a determined reader should find it
takes at most a weekend to read. As such, it makes a good read cover-to-cover.
The book is also organized very clearly with respect to laying out visually
various subsections and formalisms. For example, the grammar formalisms in
Chapter 2 are all accompanied by example figures and tables giving information
on the time complexity of parsing, how powerful they are, and their use and
implementation in learning. This organization makes the book a tremendously
useful reference tool both for people who have read the text and want to
quickly return to major points and for browsers. Further, the text makes
extensive references both to source materials and other summaries of the
field. We both come from a computational linguistics background with only
minimal experience in grammatical inference (n-gram modeling primarily). The
book gave a thorough and understandable overview of the field and is a very
useful resource to return to in the future.

The presentation of theoretical concepts at times is fast and may be
overwhelming to readers who do not have a strong background in formal language
theory or its applications, e.g. finite-state automata. However, a slower pace
may recast the book as a dry, several hundred page textbook. The presentation
of theoretical concepts is always accompanied by a task-based justification in
the text so the reader never feels that they are working through formalisms
for no reason. For example, in the first chapter before the initial
presentation of formal language theory, the authors devote half a page
motivating the utility of such a formalism in discussing grammatical inference
and justifying to the reader the importance of formally defining the task and
its aspects. Another strength in the presentation of abstract material is that
it is consistently tied to the task at hand. For example, the discussion of
grammar formalisms in Chapter 2 is accompanied by a summary of each
formalism’s ability to parse, model, and learn within the text. This keeps the
abstract material grounded in the task of learning. While the theoretical
material may at times be cumbersome, its presentation is also well motivated
and anchored in the concrete task at hand.

Another strength of the book is that it keeps its audience in mind. Just as
the concrete task is embedded within the more abstract discussion,
applications in linguistics are brought up throughout the book. The
presentation of learning regular grammars in Chapter 3 is given with the
example of stress patterns in Pintupi and Kwakwala. This grounds the formal
material very nicely for computational linguists who are more comfortable with
data. The dissimilar stress patterns provide solid, concrete examples for
finite-state automata and prefix/suffix trees. The sibilant harmony pattern in
Samala likewise provides a concrete application of stochastic strictly k-local
language which may otherwise be opaque. Chapter 4 similarly delves into
syntactic topics ranging from constituency tests to treebanks and dependency
parsing. Throughout the text, references to phonology, machine learning,
machine translation, cognitive science, and other potential applications of
grammatical inference are made, constantly reasserting the relation to
computational linguistics. While regular and non-regular languages are split
in Chapters 3 and 4, the link between the approaches given therein is
expressed clearly at the outset of Chapter 3. This establishes that
researchers primarily interested in syntax (a non-regular system) should
nonetheless engage the material on inferring regular systems. All in all, the
book reasserts its relation to specific areas of linguistics as well as
linguistics as a whole.

ABOUT THE REVIEWER

Andrew Lamont has a masters degree in Linguistics from Eastern Michigan
University and is currently finishing a masters degree in Computational
Linguistics from Indiana University. His research primarily concerns
phonological typology and computational aspects of Optimality Theory.

Lwin Moe has a masters degree in Computer Science from the Asian Institute of
Technology in Bangkok and a masters in Computational Linguistics from Indiana
University. His research interests include computational issues in Southeast
Asian languages.

------------------------------------------------------------------------------

*****************    LINGUIST List Support    *****************
                       Fund Drive 2016
Please support the LL editors and operation with a donation at:
            http://funddrive.linguistlist.org/donate/

This year the LINGUIST List hopes to raise $79,000. This money 
will go to help keep the List running by supporting all of our 
Student Editors for the coming year.

Don't forget to check out Fund Drive 2016 site!

http://funddrive.linguistlist.org/

For all information on donating, including information on how to 
donate by check, money order, PayPal or wire transfer, please visit:
http://funddrive.linguistlist.org/donate/

The LINGUIST List is under the umbrella of Indiana University and 
as such can receive donations through the eLinguistics Foundation, 
which is a registered 501(c) Non Profit organization. Our Federal 
Tax number is 45-4211155. These donations can be offset against 
your federal and sometimes your state tax return (U.S. tax payers only). 
For more information visit the IRS Web-Site, or contact your financial 
advisor.

Many companies also offer a gift matching program, such that 
they will match any gift you make to a non-profit organization. 
Normally this entails your contacting your human resources department 
and sending us a form that the eLinguistics Foundation fills in and 
returns to your employer. This is generally a simple administrative 
procedure that doubles the value of your gift to LINGUIST, without 
costing you an extra penny. Please take a moment to check if 
your company operates such a program.

Thank you very much for your support of LINGUIST!

----------------------------------------------------------
LINGUIST List: Vol-27-2005	
----------------------------------------------------------