16.1349, Review: Computational Ling: Kübler (2004)

Thu Apr 28 03:28:55 UTC 2005

LINGUIST List: Vol-16-1349. Wed Apr 27 2005. ISSN: 1068 - 4875.

Subject: 16.1349, Review: Computational Ling: Kübler (2004)

Moderators: Anthony Aristar, Wayne State U <aristar at linguistlist.org>
            Helen Aristar-Dry, Eastern Michigan U <hdry at linguistlist.org>

Reviews (reviews at linguistlist.org) 
        Sheila Dooley, U of Arizona  
        Terry Langendoen, U of Arizona  

Homepage: http://linguistlist.org/

The LINGUIST List is funded by Eastern Michigan University, Wayne
State University, and donations from subscribers and publishers.

Editor for this issue: Naomi Ogasawara <naomi at linguistlist.org>
================================================================  

What follows is a review or discussion note contributed to our 
Book Discussion Forum. We expect discussions to be informal and 
interactive; and the author of the book discussed is cordially 
invited to join in. If you are interested in leading a book 
discussion, look for books announced on LINGUIST as "available 
for review." Then contact Sheila Dooley at collberg at linguistlist.org. 

===========================Directory==============================  

1)
Date: 27-Apr-2005
From: Svetoslav Marinov < svetoslav.marinov at his.se >
Subject: Memory-Based Parsing 

-------------------------Message 1 ---------------------------------- 
Date: Wed, 27 Apr 2005 23:23:52
From: Svetoslav Marinov < svetoslav.marinov at his.se >
Subject: Memory-Based Parsing 

AUTHOR: Kübler, Sandra
TITLE: Memory-Based Parsing
SERIES: Natural Language Processing 7
YEAR: 2004
PUBLISHER: John Benjamins 
Announced at http://linguistlist.org/issues/15/15-3368.html

Svetoslav Marinov, Gothenburg University/University College 
Skoevde, Sweden

SYNOPSIS

The book consists of 8 chapters plus 2 appendices.

The first chapter (Introduction) sets up the stage for the two central 
topics in this monograph -- parsing and machine-learning. The notion 
of syntactic parsing is explained and the basic distinction between 
data-driven and discrete approaches to parsing is given. Relying on 
the former to create a broad coverage parser for German, the author 
has chosen machine learning (ML) techniques and specifically the 
paradigm of memory-based learning (MBL), which is an example of 
supervised methods within ML. Brief introduction to MBL is given and 
the place of ML in Natural Language Processing (NLP) is also 
mentioned. 

Chapter 2 gives the reader a more extensive introduction to MBL. 
Section 2.1 covers the basic approach and the crucial idea of lazy 
learning - i.e. "storing the presented training instances without 
abstraction" (p. 11), contrary to eager learning, as well as the 
similarity metrics used to calculate the distance of a new instance to 
be classified to the already stored ones. The algorithms extending the 
basic model in order to achieve faster retrieval, reduction of storage 
requirements and tolerance to noise are given in section 2.2 together 
with results of their performance. Feature selection is important in any 
k-nearest neighbour approach within ML. It is therefore necessary to 
carefully select the relevant ones and have proper metrics for 
calculating their weights. This is clearly explained in section 2.3, 
where the author discusses two of the most widely used - mutual 
information and class projection, relying mainly, but not only, on the 
work of (Wettschereck et al, 1997).

The next chapter deals with the application of MBL to syntactic 
parsing. It is a careful review of the existing albeit not a very extensive 
literature. The difficulty in utilising MBL for parsing stems from the fact 
that "[n]atural language parsing [...] is not a task that can easily be 
defined [as a classification problem]" (p. 34). While the majority of 
approaches described in this part split the task into sub-tasks in order 
to better mold parsing into the classification paradigm, Sandra 
Kübler's method is different, as it will become clear later on in the 
book. Chapter 3, therefore, describes how different authors solve the 
partial tasks or adopt the idea of MBL by cascaded classifiers. Section 
3.1 summarises two approaches to noun-phrase (NP) chunking. Next, 
shallow parsing and grammatical relation finding are discussed. 
Memory-based sequence learning, an approach in which the order of 
the words is very important, is presented with its results for NP 
chunking and subject-verb, and verb-object relation assignment. 
Certain emphasis is also given to the work of Sabine Buchholz (e.g. 
(Buchholz, 2002)), especially the optimal results and parameters 
involved in discovering relations between verb chunks and other 
chunks in a sentence. Last, in section 3.3, an approach to full parsing 
using MBL is reviewed. 

Data-oriented parsing (DOP) is the topic of chapter 4. While not an 
embodiment of the k-nearest neighbour model, DOP shares many 
similarities with MBL, which Kübler summarises on p. 58. The basic 
DOP model, i.e. DOP1, for phrase structure trees is presented in 
section 4.1. The author then goes on to describe the two important 
stages in DOP - the actual parsing of new input and the 
disambiguation procedures. The former being a generation of a 
parsed forest, while the latter is the finding of the most probable 
derivation (an NP-hard problem, unlike in standard probabilistic 
context-free parsing). Two evaluations of DOP1 are discussed in 
section 4.2 - one on the Wall Street Journal section of the Penn 
Treebank (Marcus et al., 1993) and the other on the Air Travel 
Information System corpus. In the next section the non-probabilistic 
DOP is mentioned and the comparison to its stochastic counterpart is 
shown. Certain enhancements of DOP to tackle unknown words (the 
so called DOP3 model) are presented in section 4.4. Further, the 
possibility to render DOP to PCFG in order to achieve faster search 
for the most probable derivation and alleviate the exponential growth 
of grammar, as well as a memory-based approach to DOP are cited in 
the remaining two sections.

Chapter 5 is central to the book since it presents the memory-based 
parser TueSBL and the treebank it is trained on TueBa-D/S. TueBa-
D/S is a corpus of transliterated spontaneous dialogues developed in 
the context of VERBMOBIL (speech-to-speech machine translation 
project). Therefore it covers the domain of business appointments, 
travel scheduling and hotel reservations and contains approximately 
67000 trees. In quite some detail the annotation scheme is presented 
with appropriate examples and a comparison to the Penn Treebank is 
given in section 5.1. In the following section Kübler gives reasons for 
not employing the cascaded approach to parsing in the MBL 
paradigm. Complete trees are thus viewed as classification classes, a 
decision reminiscent to DOP. To quickly batter away doubts on 
the 'trees as classes'-concept, the hybrid parsing architecture of 
TueSBL is presented. Its indispensable preprocessing module CASS, 
Abney's (1991) chunk parser and its adoption to German, and the 
present task, are described in section 5.3. The learning component in 
the MBL paradigm requires two things in order to succeed - lots of 
data and carefully selected relevant features used for the 
classification task. Since syntactically annotated corpora is after all of 
modest size, the same goes for TueBa-D/S, TueSBL is given access 
to all possible information in the treebank, yet not simultaneously 
but "ordered according to their reliability" (p. 160). These design 
decisions, the organisation of the instance base, as well as the search 
in it are clearly explained in section 5.5. A sample parse and the 
weighting scheme are then presented. The latter is also an example of 
non-standard approach. The chapter ends with an explanation of the 
backing-off strategy of TueSBL, which is triggered in those case 
where no tree structures were discovered for input chunks or no 
sentences were matched to the input sentence. 

Chapter 6 is devoted to the evaluation of TueSBL. The standard 
PARSEVAL metrics of bracketed/labelled precision and recall are 
described and TueSBL's performance in these terms is given. Since 
TueBa-D/S contains functional labels (e.g. HD for head, described in 
Chapter 5), "functional"-labelled precision and recall are also 
calculated. Additionally, certain modules of the parser are evaluated 
separately and the results are given in section 6.4. Analysis of 
TueSBL's errors is then presented. Since it is debatable whether 
PARCEVAL's measures offer an accurate picture of a parser's output, 
a dependency-based evaluation is proposed (cf. (Lin, 1995)). Section 
6.6 also discussed the conversion of TueBa-D/S into dependency-
based representation. Section 6.7 not only gives the results of the 
dependency-based evaluation, but it also offers a comparison of the 
two metrics. 

Bringing us back to the wider concept of MBL, Chapter 7 compares 
Kübler's parser to other approaches already mentions in Chapters 2-
4. TueSBL resembles DOP, since both methods "rely on trees or tree 
fragments that go beyond the local information present in rules." [p. 
252]. Yet memory requirements and levels of generalisation are quite 
different in the two approaches. TueSBL strives to return a complete 
parse while other MBL methods based on cascaded classifiers focus 
on chunk and grammatical relation finding (section 7.2). Section 7.3 
gives a comparison to a conceptually different full memory-based 
parser.

Chapter 8 concludes the book and brings up the fact that the 
presented approach "is especially tailored towards processing 
spontaneous speech." [p. 262]. Appendix A is the Stuttgart-Tuebingen 
Tagset. Appendix B gives the syntactic categories and functional 
labels of TueBa-D/S.

CRITICAL EVALUATION

The book by Sandra Kübler is an important contribution to the area of 
syntactic parsing in several respects. First, this is the monograph's 
main point - a memory-based robust parser for German spontaneous 
speech. A data-driven approach to NLP in its incarnation as an MBL is 
used for the design of a parser (TueSBL) whose architecture 
deserves to be looked at by anyone interested in parsing spoken input 
or using analogy-based methods in Computational Linguistics, or 
parsing German. 

TueSBL is designed to exploit a relatively small treebank to its 
maximum when trained on it. Complete trees are stored in the memory 
and also represent the classification classes. This is something that 
resembles DOP very much yet one do not get the weak points of pure 
DOP - large storage space and intractability of the parsing model. 
TueSBL is also different than the standard MBL approaches. It does 
not use cascades of MBL classifiers and does not only return 
grammatical relations between chunks but delivers a complete parse. 
On the other hand it is not a purely k-nearest neighbour approach. As 
such certain modifications are necessary. The notion of classification 
classes becomes a very relaxed and loose idea. These have internal 
structure and allow for modifications, such as omission of words and 
chunks from a tree (see pages 127-132). This particular decision is 
very well motivated in the book and also stems from the design and 
organisation of the treebank (TueBa-D/S). With 3 levels of annotation -
 phrasal, topological and functional (section 5.2), compared to the flat 
structure of Penn Treebank; with the particularities of spoken input - 
omissions, false starts, corrections, etc; with the free word-order of 
German (while most MBL approaches use English and the Penn 
Treebank), TueBa-D/S would offer an impossible task to the majority 
of standard MBL methods using cascaded classifier. The weighting 
metrics are also non-standard and this hinges again on the design 
decisions of TueSBL. 

Another strong point of the monograph is that the work described in it 
is clearly placed in the context of other memory-based approaches to 
parsing. Chapters 3 and 4 give in enough detail to what previous 
authors have done in this field. Chapter 2 is also essential to better 
understand TueBSL. The reader is given enough detail to understand 
the essence of memory-based learning, yet not too much to be 
distracted from the central topic - TueSBL. Recently, another way to 
exploit the benefits of MBL for syntactic parsing has been presented, 
namely the work of (Nivre and Sholz, 2004) and (Nivre et al., 2004). 
There, TiMBL (Tilburg Memory-Based Learner, (Daelemans et al., 
2003)) is used to predict the next step of a deterministic dependency 
based parser. (Nivre et al., 2004) present results for Swedish and in 
(Nivre and Scholz, 2004) the results for English are given.

Finally, the question remains how scalable Kübler's approach is. While 
the author claims that TueSBL can be trained on any treebank that 
conforms to the TueBa-D/S format, it is well known that treebanks are 
time- and resource-consuming. It would have been interesting to know 
whether TueSBL could be trained on other treebanks (not strictly 
following the TueBa-D/S format) or even on non-phrase structure 
treebanks, i.e. dependency-based representations. 

REFERENCES

Abney, Steven. 1991. Parsing By Chunks. In: Robert Berwick, Steven 
Abney and Carol Tenny (eds.), Principle-Based Parsing. Kluwer 
Academic Publishers, Dordrecht. 1991. 

Buchholz, Sabine. 2002. Memory-Based Grammatical Relation 
Finding. PhD Thesis. University of Tilburg, The Netherlands.

Daelemans, Walter, Jakub Zavrel, Ko van der Sloot and Antal van den 
Bosch. 2003. 'TiMBL: Tilburg Memory Based Learner, version 5.0, 
Reference Guide.' ILK Research Group Technical Report Series no. 
03-10, 56 pages. 

Lin, Dekang. 1995. A Dependency-based Method for Evaluating 
Broad-Coverage Parsers. In Proceedings of the 14th International 
Joint Conference on Artificial Intelligence, IJCAI-95. Montreal, Canada.

Marcus, Mitchell, Beatrice Santorini and Mary Ann Marcinkiewicz. 
1993. Building a Large Annotated Corpus of English: The Penn 
Treebank. Computational Linguistics 19:2.313-330

Nivre, Joakim, Johan Hall and Jens Nilsson. 2004. Memory-Based 
Dependency Parsing. In Ng, H. T. and Riloff, E. (eds.) Proceedings of 
the Eighth Conference on Computational Natural Language Learning 
(CoNLL). May 6-7, 2004. Boston, Massachusetts. pp. 49-56. 

Nivre, Joakim and Mario Scholz. 2004. Deterministic Dependency 
Parsing of English Text. In Proceedings of COLING 2004, Geneva, 
Switzerland, August 23-27, 2004. 

Wettschereck, Dietrich, David W. Aha and Takao Mohri. 1997. A 
Review and Empirical Evaluation of Feature Weighting Methods for a 
Class of Lazy Learning Algorithms. Artificial Intelligence Review 11(1-
5): 273-314. 

ABOUT THE REVIEWER

I am a PhD student in Computational Linguistics at Gothenburg 
University as part of GSLT (Swedish Graduate School of Language 
Technology). I am currently working on a thesis about data-driven 
approaches to syntactic parsing of Bulgarian.

-----------------------------------------------------------
LINGUIST List: Vol-16-1349