11.86, Review: Kornai: Extended Finite State Models of Lang.
LINGUIST Network
linguist at linguistlist.org
Tue Jan 18 03:34:45 UTC 2000
LINGUIST List: Vol-11-86. Mon Jan 17 2000. ISSN: 1068-4875.
Subject: 11.86, Review: Kornai: Extended Finite State Models of Lang.
Moderators: Anthony Rodrigues Aristar: Wayne State U.<aristar at linguistlist.org>
Helen Dry: Eastern Michigan U. <hdry at linguistlist.org>
Andrew Carnie: U. of Arizona <carnie at linguistlist.org>
Reviews: Andrew Carnie: U. of Arizona <carnie at linguistlist.org>
Associate Editors: Martin Jacobsen <marty at linguistlist.org>
Ljuba Veselinova <ljuba at linguistlist.org>
Scott Fults <scott at linguistlist.org>
Jody Huellmantel <jody at linguistlist.org>
Karen Milligan <karen at linguistlist.org>
Assistant Editors: Lydia Grebenyova <lydia at linguistlist.org>
Naomi Ogasawara <naomi at linguistlist.org>
James Yuells <james at linguistlist.org>
Software development: John H. Remmers <remmers at emunix.emich.edu>
Sudheendra Adiga <sudhi at linguistlist.org>
Qian Liao <qian at linguistlist.org>
Home Page: http://linguistlist.org/
Editor for this issue: Andrew Carnie <carnie at linguistlist.org>
==========================================================================
What follows is another discussion note contributed to our Book Discussion
Forum. We expect these discussions to be informal and interactive; and
the author of the book discussed is cordially invited to join in.
If you are interested in leading a book discussion, look for books
announced on LINGUIST as "available for discussion." (This means that
the publisher has sent us a review copy.) Then contact Andrew Carnie at
carnie at linguistlist.org
=================================Directory=================================
1)
Date: Sun, 16 Jan 2000 12:24:12 +0100 (MET)
From: Alberto Lavelli <lavelli at itc.it>
Subject: Review of "Extended Finite State Models of Language"
-------------------------------- Message 1 -------------------------------
Date: Sun, 16 Jan 2000 12:24:12 +0100 (MET)
From: Alberto Lavelli <lavelli at itc.it>
Subject: Review of "Extended Finite State Models of Language"
Andras Kornai (ed.), 1999, Extended Finite State Models of Language,
Cambridge University Press, pages 278+xii (plus a CD-ROM).
Reviewed by Alberto Lavelli, ITC-IRST, Trento (Italy)
I'll do a frequent use of the following acronyms:
- ECAI: European Conference on Artificial Inteligence
- NLE: the journal Natural Language Engineering
- NLP: Natural Language Processing
- POS: part of speech
SYNOPSIS
The book appears in the ACL Studies in Natural Language Processing
series and originates from a workshop held in Budapest in 1996 during
ECAI'96. In a special issue of the journal Natural Language
Engineering (vol. 2 n. 4, December 1996) a set of articles partially
overlapping with those present in this book appeared (some of the NLE
papers are only abstracts of 2 or 3 pages). The electronic versions
of the papers presented at the ECAI'96 workshop are included in the
CD-ROM accompanying the book. Now I briefly describe the papers
contained in the book. The papers that appeared also in the ECAI
workshop proceedings are marked with "ECAI" near the name of the
authors. Note that sometimes the ECAI'96 versions are considerably
shorter than those in the book.
1. Extended finite state models of language by Andras Kornai
This is a general introduction with a brief presentation of the
papers contained in the book and of the contents of the CD-ROM
2. A parser from antiquity: an early application of finite state
transducers to natural language parsing by Aravind K. Joshi and
Philip Hopely (ECAI)
This paper describes a parser based on a cascade of finite state
transducers developed at the University of Pennsylvania in 1958.
The parser is remarkably modern when compared to some of the recent
work on finite state transducers. A faithful reconstruction of the
parser is available on the CD-ROM.
3. Comments on Joshi and Hopely by Lauri Karttunen (ECAI)
It presents some brief remarks by Karttunen who underlines the
modernity of the parser described in the previous chapter by Joshi
and Hopely.
4. Implementing and using finite automata toolkits by Bruce W. Watson
(ECAI)
It describes a toolkit (FIRE Lite) developed by the author while at
the Eindhoven University of Technology and now freely available for
non-commercial use. The toolkit is available in the accompanying
CD-ROM and also on the Web at www.RibbitSoft.com (note however that
at the beginning of January 2000 I have repeatedly and
unsuccessfully tried to connect to www.RibbitSoft.com). RibbitSoft
distributes also a commercial version of the toolkit (FIRE Engine
II). Both the commercial and the non-commercial toolkit implement
algorithms for building automata from regular expressions and for
minimizing deterministic finite automata.
5. Finite state morphology and formal verification by Manuel Vilares
Ferro, Jorge Grana Gil and Pilar Alvarino Alvarino
It presents the use of verification methods to ease maintenance
during the development of resources for morphological analyzers.
Examples and experiments on Spanish are presented.
6. The Japanese lexical transducer based on stem-suffix style forms by
Masakazu Tateno, Hiroshi Masuichi and Hiroshi Umemoto (ECAI)
It describes a method for building a lexical transducer for
Japanese with stems and suffixes stored separately in different
lexicons; an extra level of automata relates canonical citation
forms and stem-suffix style forms.
7. Acquiring rules for reducing morphological ambiguity from POS
tagged corpus in Korean by Jae-Hoon Kim and Byung-Gyu Jang
It presents a method for reducing morphological ambiguities when
performing morphological analysis of Korean texts. The method
automatically infers rules (called subsumption conditions) from a
POS tagged corpus. Experiments are presented on the effectiveness
of the method.
8. Finite state based reductionist parsing for French by Jean-Pierre
Chanod and Pasi Tapanainen (ECAI; but see below the description of
the paper)
The paper describes a parser based on finite state methods. The
system includes nondeterministic tokenization, lexical analysis,
multiword recognition, shallow syntactic analysis. Examples of
treatment of French linguistic phenomena and an evaluation of the
parser effectiveness during the analysis of technical manuals are
presented. This paper is a considerably extended version of the
ECAI workshop paper by the same authors.
9. Light parsing as finite state filtering by Gregory Grefenstette
(ECAI)
The paper presents an approach to parsing useful in case of
applications that need to extract relevant information without
necessarily performing a full parse of the text. The approach is
based on the use of finite state markers and filters. An evaluation
of the parser effectiveness in analyzing a large corpus is
presented.
10. Vectorized finite state automata by Andras Kornai (ECAI)
It presents a technique of finite state parsing based on
vectorization and describes the application of such technique to
the problem of extracting relational information from texts. A
system based on such approach, NewsMonitor, is available in the
accompanying CD-ROM.
11. Finite state transducers: parsing free and frozen sentences by
Emmanuel Roche (ECAI)
In NLP finite state models are usually considered a lesser evil
with respect to more powerful techniques. The author instead
claims that they are quite suitable for representing accurately
complex linguistic phenomena. This claim is supported by examples
of finite state analysis of linguistic phenomena (i.e., support
verbs and frozen expressions).
12. Text and speech translation by means of subsequential transducers
by Juan Miguel Vilar, Victor Manuel Jimenez, Juan Carlos Amengual,
Antonio Castellanos, David Llorens and Enrique Vidal (ECAI; but
see below the description of the paper)
The authors propose a technique that increases the performance of
the learning algorithm of Subsequential Transducers from training
examples; moreover, the use of error-correcting parsing to
increase the robustness of the model is explored. Experiments on
both text and speech translation from Spanish to English are
described. This paper is a considerably extended version of the
ECAI workshop paper by J.M. Vilar, E. Vidal & J.C. Amengual.
13. Finite state segmentation of discourse into clauses by Eva Ejerhed
(ECAI)
The paper presents first of all the analysis of the correlation
between different acoustically and perceptually derived
information and clause boundaries in spoken utterances. Then it
proposes an algorithm for segmenting Swedish texts into clauses
and evaluates its performance, comparing the results on
automatically and manually tagged texts.
14. Between finite state and Prolog: constraint-based automata for
efficient recognition of phrases by Klaus U. Schulz and Tomek
Mikolajewski
The paper describes "constraint-based automata", that incorporate
features from finite state techniques and constraint programming.
Preliminary empirical evaluation of the performance of the
proposed approach against that of constraint logic programming
implementations is presented.
15. Explanation-based learning and finite state transducers:
applications to parsing lexicalized tree adjoining grammars by
Srinivas Banglore (ECAI)
The paper describes the application of explanation-based learning
(EBL) techniques to parsing Lexicalized Tree Adjoining Grammars.
Starting from a hand-crafted wide-coverage English grammar (XTAG
Group 1995), EBL techniques based on finite state transducers are
applied to customize the grammar to a specific domain. A
simplified parser, called stapler, is also described; the stapler
is used in conjunction with the results of the application of EBL
techniques. Experimental results of such approach are presented,
comparing the performance with respect to the original system
in terms of recall, number of parses and processing time.
16. Use of weighted finite state transducers in part of speech tagging
by Evelyne Tzoukermann and Dragomir R. Radev
The paper presents the application of weighted finite state
transducers to POS tagging. The approach uses a combination of
linguistic and statistical techniques for POS disambiguation.
Experimental results for French POS tagging are presented.
17. Colonies: a multi-agent approach to language generation by
Erzsebet Csuhaj-Varju (ECAI)
The paper presents "colonies", multi-agent symbol systems whose
behavior is jointly determined by the combination of very simple
grammars.
18. An innovative finite state concept for recognition and parsing of
context-free languages by Mark-Jan Nederhof and Eberhard Bertsch
(ECAI)
The paper shows that all the languages which are in the regular
closure of the class of the deterministic (context free) languages
can be recognized in linear time. The result is interesting as
this closure contains many inherently ambiguous languages.
19. Hidden Markov models with finite state supervision by Eric Sven
Ristad
The paper presents a supervised training approach to Hidden Markov
Models (HMMs). The author claims that, unlike popular ad hoc
techniques, the proposed approach is completely general, need not
make any simplifying assumptions about independence , and can take
better advantage of the information contained in the training
corpus.
In the accompanying CD-ROM there are 6 subdirectories with the
following contents:
- ECAI: the original papers of the ECAI workshop (also available on
the Web at the location: http://www.cs.rice.edu/~andras/ecai.html)
- Kanungo: a simple implementation of Hidden Markov Models realized
by Tapas Kanungo of the University of Maryland
- Kim: the morphological analyzer described in chapter 7
- Kornai: the NewsMonitor system, described in chapter 10
- Uniparse: the source code of the parser described in chapter 2 by
Joshi and Hopely
- Watson: FIRE Lite, the toolkit developed by Bruce W. Watson and
described in chapter 4
The ECAI papers not present in the book are listed below:
- Language modeling for speech recognition by Frederick Jelinek
- Regular expressions for finite-state syntactic description by Lauri
Karttunen
- Finite-state morphology and information retrieval by Kimmo
Koskenniemi
- Weighted automata in text and speech processing by Mehryar Mohri,
Fernando Pereira and Michael Riley
- Finite-state methods, binding, and anaphora by Richard Oehrle
- Efficient finite-state approximation of context free grammars by
Catherine Rood
- Multilingual finite-state noun phrase extraction by Anne Schiller
- Finite automata for processing word order by Wojciech Skut
- Multilingual text analysis for text-to-speech synthesis by Richard
Sproat
CRITICAL EVALUATION
The book contains papers that cover the application of finite state
techniques to a wide range of NLP areas (morphological analysis, POS
tagging, clause boundary detection, syntactic analysis, etc.). This
fact makes it difficult for a single person to have the necessary
expertise to thoroughly evaluate all the contributions (and, at the
same time, prospective readers will be probably interested only on a
subset of the papers, depending on their areas of interest).
Obviously also this review is partly influenced by the reviewer's
limited knowledge in some areas of NLP (particularly statistical
techniques).
The book contains some very interesting papers from both a theoretical
and an applicative perspective. However, it suffers from a defect
that is often present in books originating from workshops, i.e. the
fact that contributions are uneven in both quality (i.e., clarity of
the presentation, systematic coverage of all the main areas in the
field) and quantity (length and thoroughness of papers). This makes
also difficult to have an overall view of the different areas covered
by the various contributions.
Going to the analysis of some of the papers, I found the contribution
by Joshi & Hopely (chapter 2) particularly interesting as it provides
a useful historical perspective on the work in the field. Too often we
tend to concentrate only on the most recent contributions running the
risk to reinvent the wheel and this paper reminds us not to neglect
past experiences.
The papers by Chanod & Tapanainen and Grefenstette (chapters 8 and 9)
provide a useful indication of the current advances in the application
of finite-state techniques at Xerox Research Centre Europe in
Grenoble, one of the leading centers in this area.
The paper by Roche (chapter 11) is in my opinion not as convincing as
others by the same author, for instance that contained in (Roche and
Schabes 1997).
The papers by Tateno, Masuichi & Umemoto and Kim & Jang (chapters 6
and 7) fail to clearly explain the background and the issues specific
respectively to Japanese and Korean, needed to fully appreciate the
techniques proposed in the papers.
The paper by Srinivas (chapter 15) is a long and complex but
well-written contribution that proposes an approach that combines
manually developed generic grammars with domain-specific constraints
extracted from a corpus. The interesting experimental results seem
however due to to the particular formalism adopted (i.e., Lexicalized
Tree Adjoining Grammars) because they crucially employ some of its
specific characteristics.
In the paper by Kornai (chapter 10) it would have been interesting to
provide more details about the application of vectorized finite state
automata, i.e. the NewsMonitor system (also present in the
accompanying CD-ROM).
In some of the papers there is only a generic reference to the usage
in NLP of the techniques and tools described. For example, the paper
by Watson (chapter 4) mentions an interest in using the toolkit by
computational linguists. This same generic claim was present in the
original paper at the ECAI workshop. Provided that the ECAI workshop
took place in 1996, it would have been interesting that the author
made some explicit mentions of NLP areas where such uses have been
pursued in the meanwhile. The paper by Csuhaj-Varju (chapter 17)
presents some results from the field of formal languages. As far as I
understand, the only link with NLP is that some languages that can be
described using such results (e.g., the languages of multiple
agreement, crossed agreements and replication) would present
structures that are present in natural languages; no further evidence
for this claim is produced. Sometimes the link between formal results
and NLP is more explicit: the theoretical paper by Nederhof & Bertsch
(chapter 18) provides some more direct hints at the usefulness of the
results proposed for NLP.
Given my limited knowledge of statistical techniques, I cannot
thoroughly evaluate the paper by Ristad (chapter 19). I would only
underline that some experiments would probably be needed in order to
empirically verify the claims about the advantages of the proposed
method with respect to the standard ones.
Most contributions provide some kind of experimental evaluation of the
proposed techniques. However, it is not always clear if such
experimental results allow a real comparison with other techniques.
Among the contributions included in the NLE special issue and not
present in the book for various reasons, I have found particularly
interesting the paper "Partial parsing via finite-state cascades" by
Steven Abney. Other NLE papers not included in the book are: "Regular
expressions for language engineering" by Lauri Karttunen, Jean-Pierre
Chanod, Gregory Grefenstette and Anne Schiller, "Finite state
morphology and information retrieval" by Kimmo Koskeniemmi, and
"Multilingual text analysis for text-to-speech synthesis" by Richard
Sproat (the last two papers were present at the ECAI workshop in a
slightly different version).
The editorial care of the book is not completely satisfactory. There
are a few typos (not so many but they could be easily detected using a
spelling checker). Sometimes acronyms are used without being defined.
In a couple of papers there are pending references. The list of
bibliographical references presents some mistakes: for example, Roche
& Schabes 1997 is wrongly cited as "Finite-State Devices for Natural
Language Processing" (the correct title is "Finite-State Language
Processing"), there is one duplicated entry (Tapanainen 1995), there
is no coherence in the style of bibliographical entries. This is
obviously due to the fact that the bibliographical references are the
sum of the references of the single contributions; perhaps it would
have been better if every contribution had listed its references
separately.
The links mentioned on page 2 for HTK (Hidden Markov Model Toolkit)
and XFST (Xerox Finite State Technology) are no longer valid, probably
because of some reorganization undergone by the respective web sites.
The correct locations should be:
- http://www.entropic.com/support/FAQ/htk/index.html for HTK
- http://www.rxrc.xerox.com/research/mltt/fst/home.html for XFST
In conclusion, the book is a useful reading for people interested in
the use of finite state techniques in NLP and provides an interesting
perspective of the current status of the area. As said above, given
the wide range of NLP areas covered by the contributions, many
prospect readers will probably be interested only in part of the
papers in the book.
BIBLIOGRAPHY
Steven Abney 1996. Partial parsing via finite-state cascades. Natural
Language Engineering, 2(4): 337-344. (appeared also in the Proceedings
of the ESSLLI '96 Robust Parsing Workshop; also available at the
location http://www.sfs.nphil.uni-tuebingen.de/~abney/Papers.html)
Eva Ejerhed, Frederic Jelinek, Lauri Karttunen, Andras Kornai (eds.)
1996. Proceedings of the ECAI'96 workshop on Extended Finite State
Models of Language, Budapest, Hungary (also available at the
location http://www.cs.rice.edu/~andras/ecai.html).
Andras Kornai (ed.) 1996. Special issue on Extended Finite State
Models of Language. Natural Language Engineering, 2(4).
Emmanuel Roche and Yves Schabes (eds.) 1997. Finite-State Language
Processing. MIT Press, Cambridge, MA.
XTAG Group 1995. A Lexicalized Tree Adjoining Grammar for English.
Technical Report IRCS 95-03, University of Pennsylvania.
ABOUT THE REVIEWER
Alberto Lavelli is a researcher at ITC-IRST in Trento (Italy). His
interests are related to chart parsing of natural languages,
computational environments for grammar development and finite-state
parsing for Information Extraction. He is currently acting as Local
Arrangements Chair of IWPT 2000 (Sixth International Workshop on
Parsing Technologies) which will be held in Trento from 23 to 25
February 2000.
---------------------------------------------------------------------------
If you buy this book please tell the publisher or author
that you saw it reviewed on the LINGUIST list.
---------------------------------------------------------------------------
LINGUIST List: Vol-11-86
More information about the LINGUIST
mailing list