12.1114, Review: Marcu, Discourse Parsing (2nd review)
The LINGUIST Network
linguist at linguistlist.org
Mon Apr 23 20:04:23 UTC 2001
LINGUIST List: Vol-12-1114. Mon Apr 23 2001. ISSN: 1068-4875.
Subject: 12.1114, Review: Marcu, Discourse Parsing (2nd review)
Moderators: Anthony Aristar, Wayne State U.<aristar at linguistlist.org>
Helen Dry, Eastern Michigan U. <hdry at linguistlist.org>
Andrew Carnie, U. of Arizona <carnie at linguistlist.org>
Reviews (reviews at linguistlist.org):
Simin Karimi, U. of Arizona
Terence Langendoen, U. of Arizona
Editors (linguist at linguistlist.org):
Karen Milligan, WSU Naomi Ogasawara, EMU
Lydia Grebenyova, EMU Jody Huellmantel, WSU
James Yuells, WSU Michael Appleby, EMU
Marie Klopfenstein, WSU Ljuba Veselinova, Stockholm U.
Heather Taylor-Loring, EMU
Software: John Remmers, E. Michigan U. <remmers at emunix.emich.edu>
Gayathri Sriram, E. Michigan U. <gayatri at linguistlist.org>
Home Page: http://linguistlist.org/
The LINGUIST List is funded by Eastern Michigan University, Wayne
State University, and donations from subscribers and publishers.
Editor for this issue: Terence Langendoen <terry at linguistlist.org>
==========================================================================
What follows is another discussion note contributed to our Book Discussion
Forum. We expect these discussions to be informal and interactive; and
the author of the book discussed is cordially invited to join in.
If you are interested in leading a book discussion, look for books
announced on LINGUIST as "available for discussion." (This means that
the publisher has sent us a review copy.) Then contact Simin Karimi at
simin at linguistlist.org or Terry Langendoen at terry at linguistlist.org.
=================================Directory=================================
1)
Date: Mon, 23 Apr 2001 15:57:09 -0400 (EDT)
From: David Parkinson <davpark at microsoft.com>
Subject: Review of Marcu, Discourse Parsing & Summarization
-------------------------------- Message 1 -------------------------------
Date: Mon, 23 Apr 2001 15:57:09 -0400 (EDT)
From: David Parkinson <davpark at microsoft.com>
Subject: Review of Marcu, Discourse Parsing & Summarization
Marcu, Daniel (2000). The Theory and Practice of Discourse
Parsing and Summarization. Cambridge, MA: MIT Press. xix,
248 pp.
David Parkinson, Natural Language Group, Microsoft
Corporation.
Marcu's book (henceforth TPDPS) presents the theoretical
background to and practical results of his investigation of
the automatic derivation of discourse structure in natural
language texts. Working primarily within the theoretical
framework provided by Rhetorical Structure Theory (RST; Mann
& Thompson 1988), and using as empirical test-bed primarily
the task of text summarization, M develops a formal theory
of text structure and proposes two distinct implementations.
This work will be of interest and use both to linguists
concerned with text structure and discourse organization, as
well as to computational linguists or computer scientists
familiar with other problem domains in natural language
processing.
The structure of TPDPS is as follows:
Section I concerns the theoretical background necessary for
the presentation of M's experiments in parsing and
summarization.
Chapter 1 introduces the goals and outline of the book; M
provides an RST-style roadmap of the rhetorical structure of
the book.
Chapter 2 presents some general background common to many
theories of discourse structure, and gives a very brief
overview of RST as M's preferred theory of discourse
structure. M argues that RST is insufficient as a
computational theory, since it lacks strict well-formedness
criteria for its data structures and because it lacks a
provably complete algorithm for deriving all possible
rhetorical structure trees for a given discourse (or
discourse fragment). To begin to alleviate these
shortcomings, M proposes two flavors of well-formedness in
the form of constraints on the relation between spans and
their sub-spans. Once the data structures (discourse
representations in the form of RST trees) are formally
specified, the algorithms follow, as is standard practice.
Chapter 3 develops the formal and combinatorial properties
of the data structures that M will implement, proceeding
from the axiomatization of valid text structures (a
mathematical description of well-formedness), and on to the
proof-theoretic account of the derivation of valid text
structures. This material is potentially slow going for the
less computationally inclined reader, but M does a very good
job of making it clear and relevant to the problems
discussed previously.
Chapter 4 takes the results of the previous chapter and
briefly discusses two approaches to implementing these
results in a computational system aimed at producing
complete sets of well-formed RST data structures for a given
input.
Chapter 5 summarizes the results of the section, and raises
some very interesting questions about the sets of
assumptions underlying the material presented in Chapters 4
and 5. With respect to the general issues surrounding the
definition of well-formed rhetorical structures, M signals
open issues about other types of information that might well
be taken into account in competing implementations. And with
respect to the problem of implementing an efficient
algorithm for constructing the complete set (or some
optimally useful subset) of rhetorical structures, M
discusses other search methods that might be employed while
parsing discourse structure, in order to constrain the vast
numbers of structures that might otherwise be produced in an
unconstrained search of the parse space. Although a deep
investigation of the computationally optimal and
psycholinguistically most plausible algorithms for discourse
parsing lie outside the scope of M's stated intentions, this
is an interesting and provocative chapter, in spite of its
brevity.
Section II presents the two approaches taken by M to parsing
discourse-structure, with discussion devoted to contrasting
these two approaches.
Chapter 6 presents a cue-phrase-based rhetorical parser,
which uses the appearance of relevant discourse-functional
markers in text to indicate both the boundaries of discourse
units. Because this technique depends on the manual
extraction of relevant markers and the way that they are
used to delimit spans of text related by signaled RST
relations, M first presents the results of a corpus
evaluation he performed to determine the characteristics of
more than 450 discourse markers. These were analyzed in 2100
text fragments from the Brown corpus, and collapsed into 54
RST or RST-like relations. M first presents the precision
and recall results of the algorithms for identifying
discourse markers and clauselike units, and then moves on to
the main results concerning the overall accuracy of this
method of segmenting discourse and hypothesizing the
discourse relations holding between its component parts.
Chapter 7 presents a contrasting approach to parsing
discourse structure, using machine-learning approaches to
deduce rhetorical relations from a training corpus of hand-
tagged data. Again. The problem of hypothesizing complete
parses is broken into a segmentation problem and a labeling
problem. In the segmentation phase, the learning algorithm
is sensitized to features such as POS tags within a window 5
tokens wide and punctuation marks. In the labeling phase,
where the aim is to produce well-formed RST trees whose
nodes correctly represent the span, hierarchy, and discourse
relation of each subtree, the learning algorithm is
sensitized to a variety of features, including some that are
more semantically sophisticated (e.g., Wordnet-based
similarity of hypothesized spans).
Chapter 8 provides an overview of previous empirical
research on discourse parsing, and again concludes with a
brief but useful discussion of some open issues, especially
additional information that could be used to inform and
improve discourse parsing in future.
Section III is dedicated to the application of the
computational approaches developed in Section II to a real-
world problem: the summarization of text by extraction of
the most relevant units.
Chapter 9 presents the results of an experiment in which M
contrasts the results obtained for extract-directed
summarization by (i) human judges asked to assign importance
scores to text units of varying degrees of importance; (ii)
human analysts who hand-parsed the texts according to RST
rhetorical relations; (iii) the cue-phrase-based rhetorical
parser discussed in Chapter 6. These results are contrasted
against three baselines: (iv) the Microsoft Office 97
summarizer; (v) selection of the first N important units in
the text; and (vi) random selection of N units in the text
(where N equals the number of units that human judges chose
as important). Overall, the results indicate that the cue-
phrase-based rhetorical parser, despite its weaknesses in
recognition of the full set of rhetorical relations in a
given text, comes close to human performance when compared
against the results obtained by the human analysts.
In Chapter 10, M takes the results obtained in the
summarization task and shows how performance can be boosted
by taking into account a variety of heuristic measures aimed
at driving down ambiguity of parses produced. The approach
is a sensible one: when more than one well-formed rhetorical
structure is produced for some text, an efficient system in
one which is able to successfully choose between competing
parses and assign a higher likelihood to some subset. Among
the metrics that M employs are: the presence of explicit
discourse markers, rightward skew to trees produced, and
lexical similarity to the title. Again, some of these
metrics may be particularly relevant to the summarization
task; still others might be found to be useful. But M is
more concerned with laying out the general approach than
conclusively determining the "correct" set of heuristics,
which seems like the right approach to take in a book such
as this. The chapter concludes with an algorithm designed to
find the optimal weighting of the seven heuristics used by M,
and discussion of the improvements obtained over the untuned
rhetorical parser.
Chapter 11 summarizes the results obtained in Chapters 9 and
10, and concludes with some future directions, as well as
issuing some promissory notes for the usefulness of
rhetorical parsing in other problem domains, such as natural
language generation, machine translation, and information
retrieval.
The brevity and general succinctness of TPDPS is a bit of a
two-edged sword: M does an admirable job of presenting the
linguistic background and theoretical assumptions of RST
from a very high level perspective, and of developing the
formal properties of the data structures and algorithms he
uses. Still, the reviewer feels that the former may not
supply quite enough information to convince the more
computationally-oriented reader that RST is the best
theoretical foundation; and the interest of the more
linguistically-oriented reader may flag somewhat during the
chapters devoted to formal proof of the soundness of the
computational mechanisms employed. Still, these are minor
quibbles, and Marcu is a kind enough author to understand
and accommodate the varying needs of the audience he hopes
to attract. Above all, he provides pointers throughout the
book to open issues, alternatives, possibilities left
uninvestigated, and future directions -- which is only fair
in a field (rhetorical parsing) in its infancy. This is
overall a very useful and highly readable introduction to a
synthesis of theoretical and computational approaches to
discourse structure, suitable for use from anyone from
undergraduate through researcher.
References:
Mann, William C. & Sandra A. Thompson. 1988. Rhetorical
structure theory: Toward a functional theory of text
organization. Text 8(3): 243-281.
David Parkinson is a computational syntactician in the
Natural Language Group at the Microsoft Corporation.
---------------------------------------------------------------------------
If you buy this book please tell the publisher or author
that you saw it reviewed on the LINGUIST list.
---------------------------------------------------------------------------
LINGUIST List: Vol-12-1114
More information about the LINGUIST
mailing list