11.86, Review: Kornai: Extended Finite State Models of Lang.

LINGUIST Network linguist at linguistlist.org
Tue Jan 18 03:34:45 UTC 2000


LINGUIST List:  Vol-11-86. Mon Jan 17 2000. ISSN: 1068-4875.

Subject: 11.86, Review: Kornai: Extended Finite State Models of Lang.

Moderators: Anthony Rodrigues Aristar: Wayne State U.<aristar at linguistlist.org>
            Helen Dry: Eastern Michigan U. <hdry at linguistlist.org>
            Andrew Carnie: U. of Arizona <carnie at linguistlist.org>

Reviews: Andrew Carnie: U. of Arizona <carnie at linguistlist.org>

Associate Editors:  Martin Jacobsen <marty at linguistlist.org>
                    Ljuba Veselinova <ljuba at linguistlist.org>
		    Scott Fults <scott at linguistlist.org>
		    Jody Huellmantel <jody at linguistlist.org>
		    Karen Milligan <karen at linguistlist.org>

Assistant Editors:  Lydia Grebenyova <lydia at linguistlist.org>
		    Naomi Ogasawara <naomi at linguistlist.org>
		    James Yuells <james at linguistlist.org>

Software development: John H. Remmers <remmers at emunix.emich.edu>
                      Sudheendra Adiga <sudhi at linguistlist.org>
                      Qian Liao <qian at linguistlist.org>

Home Page:  http://linguistlist.org/


Editor for this issue: Andrew Carnie <carnie at linguistlist.org>
 ==========================================================================

What follows is another discussion note contributed to our Book Discussion
Forum.  We expect these discussions to be informal and interactive; and
the author of the book discussed is cordially invited to join in.

If you are interested in leading a book discussion, look for books
announced on LINGUIST as "available for discussion."  (This means that
the publisher has sent us a review copy.)  Then contact Andrew Carnie at
     carnie at linguistlist.org

=================================Directory=================================

1)
Date:  Sun, 16 Jan 2000 12:24:12 +0100 (MET)
From:  Alberto Lavelli <lavelli at itc.it>
Subject:  Review of "Extended Finite State Models of Language"

-------------------------------- Message 1 -------------------------------

Date:  Sun, 16 Jan 2000 12:24:12 +0100 (MET)
From:  Alberto Lavelli <lavelli at itc.it>
Subject:  Review of "Extended Finite State Models of Language"


Andras Kornai (ed.), 1999, Extended Finite State Models of Language,
Cambridge University Press, pages 278+xii (plus a CD-ROM).


Reviewed by Alberto Lavelli, ITC-IRST, Trento (Italy)


I'll do a frequent use of the following acronyms:

 - ECAI: European Conference on Artificial Inteligence
 - NLE: the journal Natural Language Engineering
 - NLP: Natural Language Processing
 - POS: part of speech


SYNOPSIS

The  book appears in  the ACL  Studies  in Natural Language Processing
series and originates from a workshop  held in Budapest in 1996 during
ECAI'96.  In   a special    issue of   the journal   Natural  Language
Engineering  (vol. 2 n. 4, December  1996) a set of articles partially
overlapping with those present in this book appeared  (some of the NLE
papers are  only abstracts of 2 or  3 pages).  The electronic versions
of the  papers presented at the  ECAI'96 workshop  are included in the
CD-ROM   accompanying  the book.  Now   I briefly  describe the papers
contained  in the book.    The papers that   appeared also in the ECAI
workshop proceedings  are  marked  with  "ECAI" near the   name of the
authors.  Note  that sometimes the  ECAI'96  versions are considerably
shorter than those in the book.


1. Extended finite state models of language by Andras Kornai

   This is  a  general introduction with   a  brief presentation of the
   papers contained in the book and of the contents of the CD-ROM

2. A parser from antiquity: an early application of finite state
   transducers to natural language parsing by Aravind K. Joshi and
   Philip Hopely (ECAI)

   This paper  describes a parser based  on a cascade of  finite state
   transducers developed at   the University of Pennsylvania in  1958.
   The parser is remarkably modern when compared to some of the recent
   work on finite state  transducers. A faithful reconstruction of the
   parser is available on the CD-ROM.

3. Comments on Joshi and Hopely by Lauri Karttunen (ECAI)

   It  presents  some brief remarks  by   Karttunen who underlines the
   modernity of the parser described in  the previous chapter by Joshi
   and Hopely.

4. Implementing and using finite automata toolkits by Bruce W. Watson
   (ECAI)

   It describes a toolkit (FIRE Lite) developed by the author while at
   the Eindhoven University of Technology and now freely available for
   non-commercial use.  The  toolkit is available  in the accompanying
   CD-ROM and also on the Web at www.RibbitSoft.com (note however that
   at  the   beginning   of  January  2000  I    have  repeatedly  and
   unsuccessfully tried to connect to www.RibbitSoft.com).  RibbitSoft
   distributes also a  commercial version of  the toolkit (FIRE Engine
   II).  Both the  commercial and the non-commercial toolkit implement
   algorithms for building  automata from regular expressions and  for
   minimizing deterministic finite automata.

5. Finite state morphology and formal verification by Manuel Vilares
   Ferro, Jorge Grana Gil and Pilar Alvarino Alvarino

   It presents the use  of  verification methods to ease   maintenance
   during  the development of  resources for  morphological analyzers.
   Examples and experiments on Spanish are presented.

6. The Japanese lexical transducer based on stem-suffix style forms by
   Masakazu Tateno, Hiroshi Masuichi and Hiroshi Umemoto (ECAI)

   It    describes a method  for    building a lexical transducer  for
   Japanese with stems and  suffixes  stored separately  in  different
   lexicons; an  extra level  of  automata relates  canonical citation
   forms and stem-suffix style forms.

7. Acquiring rules for reducing morphological ambiguity from POS
   tagged corpus in Korean by Jae-Hoon Kim and Byung-Gyu Jang

   It  presents a method  for  reducing morphological ambiguities when
   performing morphological analysis of    Korean texts.  The   method
   automatically infers rules (called  subsumption conditions) from  a
   POS tagged corpus.  Experiments  are presented on the effectiveness
   of the method.

8. Finite state based reductionist parsing for French by Jean-Pierre
   Chanod and Pasi Tapanainen (ECAI; but see below the description of
   the paper)

   The  paper describes  a parser  based on finite  state methods. The
   system   includes nondeterministic tokenization, lexical  analysis,
   multiword recognition,   shallow  syntactic analysis.   Examples of
   treatment of French  linguistic phenomena and  an evaluation of the
   parser effectiveness  during the analysis  of technical manuals are
   presented.   This paper is a  considerably  extended version of the
   ECAI workshop paper by the same authors.

9. Light parsing as finite state filtering by Gregory Grefenstette
   (ECAI)

   The paper  presents an  approach   to parsing   useful in case   of
   applications that  need   to extract relevant   information without
   necessarily performing  a full parse  of the  text. The approach is
   based on the use of finite state markers and filters. An evaluation
   of   the parser effectiveness   in analyzing   a  large corpus   is
   presented.

10. Vectorized finite state automata by Andras Kornai (ECAI)

    It presents   a technique   of   finite  state parsing  based   on
    vectorization and describes  the application of such  technique to
    the problem of  extracting  relational information from   texts. A
    system based on such   approach, NewsMonitor, is available  in the
    accompanying CD-ROM.

11. Finite state transducers: parsing free and frozen sentences by
    Emmanuel Roche (ECAI)

    In NLP  finite state models are usually  considered  a lesser evil
    with  respect  to more   powerful techniques.  The  author instead
    claims that  they  are quite suitable for  representing accurately
    complex linguistic phenomena.  This claim is supported by examples
    of finite  state analysis  of linguistic phenomena  (i.e., support
    verbs and frozen expressions).

12. Text and speech translation by means of subsequential transducers
    by Juan Miguel Vilar, Victor Manuel Jimenez, Juan Carlos Amengual,
    Antonio Castellanos, David Llorens and Enrique Vidal (ECAI; but
    see below the description of the paper)

    The authors propose a  technique that increases the performance of
    the learning  algorithm of Subsequential Transducers from training
    examples;   moreover,  the use    of  error-correcting parsing  to
    increase the robustness of the  model is explored.  Experiments on
    both   text and  speech translation from    Spanish to English are
    described.  This paper is a  considerably  extended version of the
    ECAI workshop paper by J.M. Vilar, E.  Vidal & J.C.  Amengual.

13. Finite state segmentation of discourse into clauses by Eva Ejerhed
    (ECAI)

    The paper  presents first of  all the analysis of  the correlation
    between   different     acoustically  and    perceptually  derived
    information  and clause boundaries   in spoken utterances. Then it
    proposes an algorithm   for segmenting Swedish texts into  clauses
    and evaluates  its   performance,    comparing the   results    on
    automatically and manually tagged texts.

14. Between finite state and Prolog: constraint-based automata for
    efficient recognition of phrases by Klaus U. Schulz and Tomek
    Mikolajewski

    The paper  describes "constraint-based automata", that incorporate
    features from  finite state techniques and constraint programming.
    Preliminary empirical  evaluation  of   the  performance of    the
    proposed approach   against that of  constraint logic  programming
    implementations is presented.

15. Explanation-based learning and finite state transducers:
    applications to parsing lexicalized tree adjoining grammars by
    Srinivas Banglore (ECAI)

    The paper describes the  application of explanation-based learning
    (EBL) techniques to  parsing Lexicalized Tree Adjoining  Grammars.
    Starting from a  hand-crafted wide-coverage  English grammar (XTAG
    Group 1995), EBL techniques based on  finite state transducers are
    applied to  customize  the   grammar to  a   specific  domain.   A
    simplified parser, called stapler, is  also described; the stapler
    is used in conjunction with the results  of the application of EBL
    techniques.  Experimental results  of such approach are presented,
    comparing the performance with    respect to the original   system
    in terms of recall, number of parses and processing time.

16. Use of weighted finite state transducers in part of speech tagging
    by Evelyne Tzoukermann and Dragomir R.  Radev

    The  paper  presents  the  application  of weighted  finite  state
    transducers to POS tagging.  The   approach uses a combination  of
    linguistic  and statistical   techniques  for  POS disambiguation.
    Experimental results for French POS tagging are presented.

17. Colonies: a multi-agent approach to language generation by
    Erzsebet Csuhaj-Varju (ECAI)

    The paper   presents "colonies", multi-agent  symbol systems whose
    behavior  is jointly determined by  the combination of very simple
    grammars.

18. An innovative finite state concept for recognition and parsing of
    context-free languages by Mark-Jan Nederhof and Eberhard Bertsch
    (ECAI)

    The paper  shows that all the  languages which  are in the regular
    closure of the class of the deterministic (context free) languages
    can be recognized  in linear time.  The result  is interesting  as
    this closure contains many inherently ambiguous languages.

19. Hidden Markov models with finite state supervision by Eric Sven
    Ristad

    The paper presents a supervised training approach to Hidden Markov
    Models   (HMMs). The author claims   that,  unlike popular ad  hoc
    techniques, the proposed  approach is completely general, need not
    make any simplifying assumptions about independence , and can take
    better  advantage of  the information  contained  in the  training
    corpus.


In  the accompanying CD-ROM   there  are  6 subdirectories  with   the
following contents:

 - ECAI: the original papers of the ECAI workshop (also available on
   the Web at the location: http://www.cs.rice.edu/~andras/ecai.html)
 - Kanungo: a simple implementation of Hidden Markov Models realized
   by Tapas Kanungo of the University of Maryland
 - Kim: the morphological analyzer described in chapter 7
 - Kornai: the NewsMonitor system, described in chapter 10
 - Uniparse: the source code of the parser described in chapter 2 by
   Joshi and Hopely
 - Watson: FIRE Lite, the toolkit developed by Bruce W. Watson and
   described in chapter 4

The ECAI papers not present in the book are listed below:

 - Language modeling for speech recognition by Frederick Jelinek
 - Regular expressions for finite-state syntactic description by Lauri
   Karttunen
 - Finite-state morphology and information retrieval by Kimmo
   Koskenniemi
 - Weighted automata in text and speech processing by Mehryar Mohri,
   Fernando Pereira and Michael Riley
 - Finite-state methods, binding, and anaphora by Richard Oehrle
 - Efficient finite-state approximation of context free grammars by
   Catherine Rood
 - Multilingual finite-state noun phrase extraction by Anne Schiller
 - Finite automata for processing word order by Wojciech Skut
 - Multilingual text analysis for text-to-speech synthesis by Richard
   Sproat


CRITICAL EVALUATION

The  book contains papers that  cover the application  of finite state
techniques to  a wide range  of NLP areas (morphological analysis, POS
tagging, clause boundary detection, syntactic  analysis, etc.).   This
fact makes it  difficult  for a single  person  to have  the necessary
expertise to thoroughly evaluate  all  the contributions (and,  at the
same  time,  prospective readers  will be  probably interested only  on a
subset  of   the  papers, depending   on   their  areas of  interest).
Obviously  also this review is    partly influenced by the  reviewer's
limited  knowledge in  some  areas of   NLP  (particularly statistical
techniques).

The book contains some very interesting papers from both a theoretical
and an applicative perspective.    However, it suffers from  a  defect
that  is often present in books  originating from workshops, i.e.  the
fact  that contributions are uneven in  both quality (i.e., clarity of
the presentation, systematic  coverage  of all the  main areas  in the
field) and  quantity (length and  thoroughness  of papers). This makes
also difficult to have an overall  view of the different areas covered
by the various contributions.


Going to the analysis of some of  the papers, I found the contribution
by Joshi & Hopely (chapter 2) particularly  interesting as it provides
a useful historical perspective on the work in the field. Too often we
tend to concentrate only on the  most recent contributions running the
risk  to reinvent the  wheel and this paper reminds  us not to neglect
past experiences.

The papers by Chanod & Tapanainen and  Grefenstette (chapters 8 and 9)
provide a useful indication of the current advances in the application
of  finite-state   techniques  at  Xerox  Research    Centre Europe in
Grenoble, one of the leading centers in this area.

The paper by Roche (chapter 11) is in my  opinion not as convincing as
others by the same  author, for instance  that contained in (Roche and
Schabes 1997).

The  papers by Tateno, Masuichi  & Umemoto and Kim  & Jang (chapters 6
and 7) fail to clearly explain the background  and the issues specific
respectively to Japanese and  Korean,  needed to fully appreciate  the
techniques proposed in the papers.

The  paper  by  Srinivas  (chapter 15)  is   a  long and complex   but
well-written  contribution  that proposes  an approach  that  combines
manually  developed  generic grammars with domain-specific constraints
extracted from a  corpus.  The  interesting experimental results  seem
however due to to  the particular formalism adopted (i.e., Lexicalized
Tree Adjoining Grammars) because  they  crucially employ some of   its
specific characteristics.

In the paper by Kornai (chapter 10) it  would have been interesting to
provide more details about the application  of vectorized finite state
automata,   i.e. the    NewsMonitor  system   (also  present   in  the
accompanying CD-ROM).

In some of the  papers there is only a  generic reference to the usage
in NLP of the techniques and tools described.   For example, the paper
by  Watson (chapter 4)  mentions an interest  in using  the toolkit by
computational linguists.   This same generic  claim was present in the
original paper at the ECAI workshop.  Provided  that the ECAI workshop
took  place in 1996,  it would have  been  interesting that the author
made some explicit  mentions of NLP areas  where  such uses  have been
pursued  in the  meanwhile.  The paper  by  Csuhaj-Varju (chapter  17)
presents some results from the field of formal languages.  As far as I
understand, the only link with NLP is  that some languages that can be
described using    such results (e.g.,    the  languages  of  multiple
agreement,   crossed   agreements   and replication)   would   present
structures that are present in  natural languages; no further evidence
for this claim is produced.  Sometimes the link between formal results
and NLP is more explicit: the theoretical paper  by Nederhof & Bertsch
(chapter 18) provides some more direct hints  at the usefulness of the
results proposed for NLP.

Given  my limited knowledge   of   statistical techniques,  I   cannot
thoroughly evaluate the  paper by Ristad  (chapter 19).   I would only
underline that some experiments  would probably be  needed in order to
empirically  verify  the claims  about  the advantages of the proposed
method with respect to the standard ones.

Most contributions provide some kind of experimental evaluation of the
proposed   techniques.   However, it  is  not   always  clear  if such
experimental results allow a real comparison with other techniques.


Among   the contributions included  in the  NLE  special issue and not
present in  the book for  various reasons,  I have found  particularly
interesting the paper "Partial parsing  via finite-state cascades"  by
Steven Abney.  Other NLE papers not included in the book are: "Regular
expressions for language  engineering" by Lauri Karttunen, Jean-Pierre
Chanod, Gregory  Grefenstette  and    Anne Schiller,  "Finite    state
morphology  and  information  retrieval"  by   Kimmo Koskeniemmi,  and
"Multilingual text  analysis for  text-to-speech synthesis" by Richard
Sproat (the last  two papers  were present at  the ECAI  workshop in a
slightly different version).

The editorial care of the book is not  completely satisfactory.  There
are a few typos (not so many but they could be easily detected using a
spelling checker).  Sometimes acronyms are used without being defined.
In a   couple of papers there   are pending  references.  The  list of
bibliographical references  presents some mistakes: for example, Roche
& Schabes 1997  is wrongly cited  as "Finite-State Devices for Natural
Language Processing"   (the correct  title  is "Finite-State  Language
Processing"), there is one  duplicated entry (Tapanainen 1995),  there
is no coherence in   the style of  bibliographical entries.    This is
obviously due to the fact that the  bibliographical references are the
sum of  the references of the single  contributions;  perhaps it would
have  been better  if   every contribution had  listed  its references
separately.

The  links mentioned on page 2  for HTK  (Hidden Markov Model Toolkit)
and XFST (Xerox Finite State Technology) are no longer valid, probably
because of some reorganization  undergone by the respective web sites.
The correct locations should be:

 - http://www.entropic.com/support/FAQ/htk/index.html for HTK
 - http://www.rxrc.xerox.com/research/mltt/fst/home.html for XFST


In conclusion, the book  is a useful reading  for people interested in
the use of finite state techniques in NLP  and provides an interesting
perspective of the  current status of the  area.  As said above, given
the  wide   range of  NLP areas  covered  by  the  contributions, many
prospect readers will   probably  be interested  only in  part  of the
papers in the book.


BIBLIOGRAPHY

Steven Abney 1996. Partial parsing via finite-state cascades. Natural
Language Engineering, 2(4): 337-344. (appeared also in the Proceedings
of the ESSLLI '96 Robust Parsing Workshop; also available at the
location http://www.sfs.nphil.uni-tuebingen.de/~abney/Papers.html)

Eva Ejerhed, Frederic Jelinek, Lauri Karttunen, Andras Kornai (eds.)
1996. Proceedings of the ECAI'96 workshop on Extended Finite State
Models of Language, Budapest, Hungary (also available at the
location http://www.cs.rice.edu/~andras/ecai.html).

Andras Kornai (ed.) 1996. Special issue on Extended Finite State
Models of Language. Natural Language Engineering, 2(4).

Emmanuel Roche and Yves Schabes (eds.) 1997.  Finite-State Language
Processing. MIT Press, Cambridge, MA.

XTAG Group 1995. A Lexicalized Tree Adjoining Grammar for English.
Technical Report IRCS 95-03, University of Pennsylvania.


ABOUT THE REVIEWER

Alberto Lavelli  is a researcher at  ITC-IRST  in Trento  (Italy). His
interests     are related to   chart   parsing  of natural  languages,
computational  environments for  grammar  development and finite-state
parsing for Information Extraction.  He  is currently acting as  Local
Arrangements Chair of  IWPT  2000  (Sixth International  Workshop   on
Parsing  Technologies) which will  be  held in Trento  from  23 to  25
February 2000.


---------------------------------------------------------------------------

If you buy this book please tell the publisher or author
that you saw it reviewed on the LINGUIST list.

---------------------------------------------------------------------------
LINGUIST List: Vol-11-86



More information about the LINGUIST mailing list