33.496, Review: Computational Linguistics: Gorman, Sproat (2021)

Wed Feb 9 15:45:35 UTC 2022

LINGUIST List: Vol-33-496. Wed Feb 09 2022. ISSN: 1069 - 4875.

Subject: 33.496, Review: Computational Linguistics: Gorman, Sproat (2021)

Moderator: Malgorzata E. Cavar (linguist at linguistlist.org)
Student Moderator: Billy Dickson
Managing Editor: Lauren Perkins
Team: Helen Aristar-Dry, Everett Green, Sarah Goldfinch, Nils Hjortnaes,
      Joshua Sims, Billy Dickson, Amalia Robinson, Matthew Fort
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org

Homepage: http://linguistlist.org

Please support the LL editors and operation with a donation at:
           https://funddrive.linguistlist.org/donate/

Editor for this issue: Billy Dickson <billyd at linguistlist.org>
================================================================

Date: Wed, 09 Feb 2022 10:43:58
From: Michael Maxwell [mmaxwell at umd.edu]
Subject: Finite-State Text Processing

Discuss this message:
http://linguistlist.org/pubs/reviews/get-review.cfm?subid=36761457

Book announced at http://linguistlist.org/issues/32/32-2067.html

AUTHOR: Kyle  Gorman
AUTHOR: Richard  Sproat
TITLE: Finite-State Text Processing
SERIES TITLE: Synthesis Lectures on Human Language Technologies
PUBLISHER: Morgan & Claypool Publishers
YEAR: 2021

REVIEWER: Michael B. Maxwell, University of Maryland

INTRODUCTION

WHY FSTs?

Since at least the publication in 1968 of Chomsky and Halle's Sound Patterns
of English (and ultimately going back to the work of Trubetzkoy and others in
the 1930s), most linguists have considered the sounds of natural languages to
be composed of phonological feature-value pairs, like [+voice] or [-nasal]. 
And during the last 30 years or more, the discussion in phonology has revolved
around the structure of those features (e.g. debates over feature geometry),
as well as what the universal set of features is (or more recently, whether
there is such a set), and the role of rules vs. constraints.  Probably most
phonologists today would come down on the side of a relatively articulated
(pun intended) feature geometry, and in favor of constraints instead of rules.

Abandon features and constraints, all ye who enter computational phonology and
morphology!  The tool of choice for modeling morphology and phonology in
computational linguistics is the Finite State Transducer (FST).  I know of
only one morphological parser that allows stating rules in terms of
phonological features (the Hermit Crab parser, one of the two parsers
available in SIL's FieldWorks Language Explorer, or FLEx), and it is not an
FST.  While one could in principle build an FST that expressed rules in terms
of features, so far as I know, no one has done so.  Indeed, FSTs do not even
represent phonemes directly; instead, they represent everything in terms of
characters in a writing system.  This includes natural classes, which are
defined extensionally (by listing the character strings included) rather than
intensionally (by features common to all the sounds in the natural class).

Moreover, FSTs have difficulty in modeling constraint-based theories, such as
Optimality Theory (but see Karttunen 1998).  Instead, most modern FSTs use a
rule-based approach that an American structuralist from the 1950s would be
comfortable with.

Given that FSTs force one to ignore 50 years of theoretical phonology, why
should a linguist be interested?  The answer is simple: it allows you to model
linguistic data in an observationally adequate fashion; that is, it allows you
to recognize and generate the words of a language.  (Some readers may be aware
that full word reduplication is not finite state; I will return to this
later.)  One can use FSTs to build morphological parsers, spell correctors for
highly inflected languages, dictionary lookup programs that make allowance for
misspellings, and many other useful tools.  In other words, while FSTs do not
allow you to model the latest and greatest theory of phonology or morphology,
they do make it practical to do (some kinds of) language engineering.  And
that is a Good Thing, in my opinion.

Now, on to the review.

SUMMARY

The book is surprisingly short: 119 pages including appendices, plus an
extensive bibliography and a tiny index.  The first chapter is a largely
mathematical introduction to finite state machines (automata and transducers),
including weighted finite state automata and transducers.  Using weights
allows you to model the fact that some forms may be more likely than others. 
While this can be used to encode the likelihood of seeing particular words, a
more likely use in language engineering is the parsing and lookup of
misspelled forms.  For languages with standardized spelling, this is
important; for languages without standardized spelling or languages where the
spelling is in flux (the vast majority of languages), it is crucial.

The second chapter presents the low level Pynini implementation.  Pynini
provides a Python wrapper (hence the 'Py' of 'Pynini') for the OpenFst library
(written in C++), together with additional functionality also written in
Python.  Pynini and OpenFST can be freely downloaded and used under the Apache
license.  C++ and Python are computer programming languages.  (I verified the
license from the Pynini websites, https://pypi.org/project/pynini/ and
https://www.opengrm.org/twiki/bin/view/GRM/Pynini; as far as I can tell, the
book does not document this.)  The second chapter also includes a discussion
of character encoding (although if you have to deal with Unicode issues like
normalization, you'll need to supplement the discussion here with other
sources).

A couple of library functions are described here (most are in later chapters).
 In this and the next couple of chapters, little or no mention is made of
linguistic applications.  For example, the functions which create string maps,
''the union of many input/output string pairs'', are useful for reading in
dictionaries, but that is left largely implicit in the text.

There is a short comparison at the end of this chapter with other FST tools. 
The two most significant differences are that Pynini supports weighted FSTs
(as do some other FST toolkits), and that it functions as a Python library
(more on this in the Evaluation section, below).

Chapter Three discusses ''basic algorithms'' such as intersection or union of
two FSTs.  The discussion is still at a mathematical level; no mention is made
in this chapter of the fact that some of these algorithms correspond to common
linguistic process (concatenation is what happens in item-and-arrangement
models of affixation, and composition can be used to model the application of
a phonological rule), whereas other algorithms (difference and reversal) have
no obvious correspondences to linguistic processes.

The fourth chapter is aptly titled ''Advanced Algorithms'': algorithms having
to do with optimization and path length are briefly presented, with an
emphasis on the mathematical rather than linguistic side; most linguists will
probably skip over this.  Mention is made of the fact that the distance and
path length algorithms are useful in speech recognition, but they can also be
used to correct misspellings.

With the fifth chapter, there is a turn to a linguistic orientation: rewrite
rules, which more or less correspond to linguists' phonological rules, are
introduced.  These rules are illustrated with examples of a Spanish
grapheme-to-phoneme converter, Finnish vowel harmony, and the conversion
between written expressions of currency (like ''$50'') and their spoken
equivalents (''fifty dollars'').

Chapter Six provides several worked examples of morphology: Russian noun
suffixes (with inflection classes), Tagalog verb infixes, and a partial
treatment of Yowlumne verb aspects.

Latin is briefly mentioned, with the statement that ''the Pynini distribution
includes a fairly extensive treatment of over 2500 Latin verbs.''  This is in
fact missing from the current distribution (although this version does contain
the code for Latin noun declensions); I am told it will be added to the next
version of Pynini (Kyle Gorman, p.c.).

Chapter Seven discusses several applications of weighted finite state
transducers, highly relevant to practical applications.  Chapter Eight tries
to predict the future, asking, “In light of the current trends toward machine
learning and neural net architectures, will FSTs still be used?”  Gorman and
Sproat are (predictably) optimistic; I suspect (and hope!) they are right. 
The appendices provide further technical details on installation, regular
expression parsing, etc.

SOFTWARE IMPLEMENTATION

Since this is not really a book that one would read for fun, or even for its
theoretical implications--it is rather a description of an open source
software package--I downloaded and installed the Pynini software onto my
computer so that I could test it.  (I am running Windows 10, with the Windows
Subsystem for Linux.)  Appendix A describes two installation methods:
installation from pre-compiled source, and installation from source followed
by compilation.  I tried both.

For the pre-compiled distribution, one must first install the Anaconda
package, which proved not to be straightforward (although that is not the
fault of the Pynini implementers); there are instructions at
https://gist.github.com/kauffmanes/.  One then runs `conda activate python3'. 
(There may be other ways to accomplish this step.)  Once I had done this, it
was possible to `import pynini' in Python (but see below about the
PYTHONPATH).  If one uses this method, it will probably be helpful to also
download the Pynini source code, which includes implemented examples of
paradigms etc. (choose the most recent version from
https://www.opengrm.org/twiki/bin/view/GRM/PyniniDownload) which serve as
learning aids.

I also installed Pynini from source code.  This was time consuming (and at
least in my case required setting up a swap file to compile OpenFST), but it
does provide the Pynini library files, plus complete versions of the examples
in Chapter 6 (contained in a directory named 'tests') and Chapter 7 (in the
subdirectory 'pynini/examples').  I also found the source code for paradigms,
features etc. (in the directory 'pynini/lib') to be enlightening.

With both methods, I had to add /usr/local/anaconda3/lib/ to the environment
variable PYTHONPATH; with the source code method, I also needed to add
/usr/local/lib/ to the LD_LIBRARY_PATH.

I next tried to import a dictionary into Pynini, using the string_file()
function (described in section 2.3.2 of the book).  This led to my first error
message, ''Unterminated escape.''  (This error message was accompanied by a
deprecation warning concerning a third party library used by Pynini, `apport';
it appears that this is a new issue, since the bug report about this warning
on the Apport package's website is dated 15 October 2021. The deprecation
warning also appears when some of the other error messages are output;
presumably this will go away whenever the Apport package is updated.)

The ''unterminated escape'' error turned out to be because I had guessed wrong
about which characters need to be escaped (preceded with a backslash in Python
strings to be read as regular expressions by Pynini).  Specifically, I had
wrongly guessed that I would need to escape a single quote character (which
represents a glottal stop in Tzeltal, the language I was working with), as I
would in several other FST regular expression languages; but in Pynini,
apparently only square brackets need to be escaped.  In fact, the other
functions for which special characters like '.' and '*' are used in most
regular expression languages are instead performed in Pynini by Python
functions.  Once I had corrected my dictionary import file so that quote marks
were not escaped, this error disappeared.

Another error message which I encountered frequently was ''Composition
failure''.  This can have several sources, but one is wrongly defining a
variable that encodes the equivalent of '.*' in many regular expression
languages.  The difficulty here is that what constitutes the wild card
character '.' needs to be specified in Pynini, and it can be treacherous to
define--I found myself frequently needing to add some character that I wasn't
expecting to exist in my grammar, probably a character in names of
morphosyntactic features or a punctuation character.  Fortunately, Pynini
provides an alternative way to define '.*': for each category (noun,
intransitive verb etc.) that you define, its ''sigma star'' is automatically
created.  Thus, if you define a category 'noun', noun.sigma_star is
automatically defined and functions as '.*'.  This can then be used as needed
in rewrite (phonological) rules, etc.  (Alternatively, sigma_star could
presumably be defined for all categories at once by simply using byte.BYTE, a
constant defined in one of the source code files.)

Once past these issues, I was able to use Pynini to build and test
inflectional morphology analyzers and generators for Tzeltal, including affix
allomorphs based on phonological environments (but see the Evaluation section
below for problems using the paradigm library for agglutinating morphology).

Finally, I tested weighted transducers.  I wanted to capture an instance of
dialectal variation in Tzeltal, by which /x/ (written 'j') and /h/ undergo
phoneme merger in some varieties.  The standard orthography follows the
dialects that retain this distinction.  Since this phoneme can appear in stems
(lexemes), simply listing the two forms everywhere they appear would be
cumbersome.  Instead, I wrote a replacement rule (a sort of phonological rule)
that optionally converts 'j' to 'h', with a weight so the standard orthography
forms were preferred.  At first I hand-crafted the rule to be optional, but
later discovered that the rule constructor takes an optional argument
mode=''opt'', which results in the rule's application being optional (this
does not appear to be documented in the book, but is mentioned in the Python
'help' for cdrewrite()).  The result was a transducer that can recognize forms
written either way, but assigns a penalty to ''incorrectly'' spelled forms
(which would be important for dictionary lookup in case there is another word
with the same spelling as the ''incorrectly'' written one).

Such a fuzzy lookup mechanism using optional weighted rules is useful in a
variety of situations, for example dictionary lookup where orthographies are
fluid; or (like Tzeltal) where there is predictable phonological variation
among dialects; or where second language learners are trying to look up words
in a dictionary, but are uncertain about spelling or have difficulty
distinguishing particular sounds.  While simple Levenshtein distance metrics
(whose implementation is described in Chapter Seven) can help in such
situations, the fact that some spelling mistakes are much more likely than
others makes weighted rules a better choice.  For example, if a phonemic
contrast (or an entire phoneme) is missing in one language, it is often harder
for native speakers of that language to hear the contrast in a second
language.  Thus, for most English speakers the contrast between dental and
retroflex sounds in many Indic languages is a source of confusion.  A
dictionary lookup tool that compensates for this particular difficulty using a
weighted optional rule is more likely to return useful results for such users
than a tool that simply considers unweighted Levenshtein edit distance, which
would make a retroflex 't', 'p' or 'f' equally likely misspellings for a
dental 't'.

EVALUATION

The book is largely  written with a computational audience in mind, not
linguists.  Unlike for example Beesley and Karttunen (2003), there is
relatively little exposition of how one would use Pynini to account for a wide
variety of phonological and morphological phenomena, or in some cases even a
mention of their existence.  This is clear, for example, in the lack of
mention of whole word reduplication, which is non-finite state, and so cannot
easily be represented in most finite state toolkits.  This in fact represents
an advantage for Pynini: while the Xerox FST described in Beesley and
Karttunen's book provided a work-around for reduplication, there are
situations where this work around cannot be used.  But the fact that Pynini is
embedded in Python allows a very simple solution: Python can copy strings of
unlimited length, making reduplication trivial.  Agglutinating morphology is
another notable absence (more on this below).  In general, I believe the book
could have been made more accessible to linguists by including more examples
modeling additional kinds of phonological and morphological processes.

Somewhat balancing this lack is the very useful discussions in Chapter Seven
of things linguists often ignore: fuzzy string matching (such as misspellings
and dictionary lookup), converting between the representation of numbers as
digit + punctuation sequences (''3.14'') and their spoken form (''three point
one four''), chatspeak normalization, and other techniques needed for real
world applications.

The discussion of paradigms in Chapter Six equates paradigm slots with
paradigm cells, a rather unusual (and at least to me, confusing) use of the
term ''slot.''  In morphology, a slot usually refers to (roughly) a position
relative to a word's stem which contains alternative inflectional affixes,
i.e. different slots are in a syntagmatic relationship; whereas a paradigm
cell refers to a position in an abstract paradigm and which contains a single
inflected wordform, with the cells of the paradigm in a paradigmatic
relationship with each other.

I also encountered some difficulties making the paradigm software work.  As
mentioned several times above, in its present state the Pynini paradigms
library does not work with agglutinating morphology, i.e. with categories that
take two or more affixes in a single paradigm cell (Turkish or Finnish, for
example, and Tzeltal).  This does not mean that Pynini is inherently incapable
of being used with agglutinating languages, but it does mean that the paradigm
library would need further development in order to be used in this situation. 
This issue was confirmed by Richard Sproat (p.c.), who pointed out that there
are other ways to generate agglutinating morphology in Pynini that do not use
the paradigm library.  That said, I found that library to be convenient for
its other capabilities.  Since Pynini is open source software, it should be
possible for an interested user to modify this library to allow its use for
agglutinating morphology.

There are occasional typos, most of which are readily interpretable.  A list
of errata can be found at http://wellformedness.com/fstp-errata.html.

There are a few confusing phrasings (like the discussion of rational relations
vs. regular expression substitutions on page 10), and more rarely,
misstatements.  An example of the latter appears on page 22, where it is
stated that the ASCII character encoding system ''is only sufficient for
English text.''  This is not true; it is perfectly sufficient for a number of
other languages, including Bahasa Indonesia, Cebuano, Hawaiian, Inuktitut (in
the non-syllabary writing system) and Swahili.

The index is tiny; I found myself wondering at times where some function or
parameter had been defined, but the index was no help--so I ended up paging
back through the text, then penciling the term in to the index.  It appears
that the book was originally prepared as a PDF with search and clickable
links, which would have made the index less important.  A few chapters are
accessible in this form at
https://books.google.com/books/about/Finite_State_Text_Processing.html?id=fyIw
EAAAQBAJ.

As for the software, I have described some issues above, but I will add that I
did not find some of the error messages particularly helpful.  For example, I
wrote some code asking for the paradigm of a particular Tzeltal verb,
following the methodology for Russian nouns (section 6.5.1 in the book).  But
I misspelled the stem (in the sense that my spelling in a test case did not
match that in my dictionary).  The error message was ''Composition failure''
(mentioned above in a different context).  Looking at the source code, I could
track down where this came from, but a more explicit message would have made
this easier, perhaps along the lines of ''The stem ''ko'tan'' is not among the
stems listed for the paradigm 'paraIV'''.  Having said that, I will add that
crafting useful error messages--and pointing to the exact point in the user's
code that will help a user find the error--is a difficult task.

Finally, I offer some comparisons with other finite state tools.  Perhaps the
best known tool is the Xerox Finite State Toolkit (XFST), described in Beesley
and Karttunen (2003).  The book is an excellent description of both finite
state technology in general and the XFST (and LEXC) program in particular,
largely from the viewpoint of a linguist.  While the original version of XFST
did not handle Unicode, a Unicode-capable version was released later. 
However, XFST is proprietary software, does not handle weighted transducers,
and cannot be directly integrated into programming languages.  An open source
version of this FST language called Foma was written by Mans Hulden (2009),
and is available from https://fomafst.github.io under the Apache license. 
Foma can be integrated into C and Python programming language code.

The Stuttgart Finite State Transducer (SFST, Schmid 2005) uses a different
programming language syntax from that of XFST.  It is largely comparable to
XFST in its capabilities, but provides some functionality for weighted
transducers (according to the website,
https://www.cis.lmu.de/~schmid/tools/SFST/, although I was unable to find
documentation) and has a Python binding.  SFST is available as open source
under the GNU license.

The Helsinki Finite State Toolkit (HFST) is another open source implementation
under the GPL license, available from https://github.com/hfst/hfst (see also
https://github.com/hfst/hfst/wiki).  It provides both weights and a Python
interface, and implements a number of the above FST tools under the hood. 
Another tool is Kleene (Beesley 2012, downloadable from
http://kleene-lang.org); Kleene is Java-based.

Which tool is best?  The answer to that is beyond the scope of this review,
and would involve questions like speed and memory use, and support for bug
fixes and other improvements.  I will say that for a linguist-friendly
introduction, XFST (and by extension, Foma, which implements more or less the
same programming language) shines--but it does not provide weights, which are
useful in practical applications.  I have not used HFST or Kleene, although
both look promising.  I have used SFST, but the version I worked with did not
support weights, so I cannot comment on the current version.  Where Pynini
shines is its tighter integration into Python, a programming language with a
huge variety of other libraries of relevance to linguistics, both on the
traditional programming side and on the machine learning side.  While several
other finite state tools (notably HFST) also integrate with Python, one major
difference is that these other tools use regular expression languages, whereas
(as mentioned above) Pynini largely provides Python functions as a way to
create a finite state network (see Gorman 2016 for more on the differences). 
Python also provides an easy way to model full word reduplication in Pynini,
which makes it easier to handle languages like Bahasa Indonesian.

Pynini also implements a function pdt_replace(), which is said to allow the
creation of push-down transducers, i.e. recursive transition networks (similar
to context free phrase structure grammars); I do not believe other FST tools
provide such a function.  The book does not describe this capability, and I
did not test it.

Bottom line: you have choices.

REFERENCES

Beesley, Kenneth. 2012. Kleene, a Free and Open-Source Language for
Finite-State Programming

K. R. Beesley.  Pp. 50--54 in Proceedings of the 10th International Workshop
on Finite State Methods and Natural Language Processing.  ACL. 
https://aclanthology.org/W12-6209/.

Beesley, Kenneth R. & Lauri Karttunen. 2003. Finite State Morphology CSLI
Studies in Computational Linguistics. Chicago: University of Chicago Press.

Gorman, Kyle. 2006.  Pynini: A Python library for weighted finite-state
grammar compilation.  Pp. 75--80 in Proceedings of the SIGFSM Workshop on
Statistical NLP and Weighted Automata, ACL. 
https://aclanthology.org/W16-2409.pdf.

Hulden, Mans. 2009.  Foma: a finite state compiler and library.  Pp. 29--32 in
Proceedings of the 12th Conference of the European Chapter of the Association
for Computational Linguistics, ACL.  https://aclanthology.org/E09-2008.pdf.

Karttunen, Lauri. 1998.  ''The Proper Treatment of Optimality in Computational
Phonology.''  Pages 1-12 in Proceedings of FSMNLP'98. International Workshop
on Finite-State Methods in Natural Language Processing. Bilkent University.
Ankara, Turkey.  https://arXiv:cmp-lg/9804002v2.

Schmid, Helmut. 2005. A Programming Language for Finite State Transducers,
Proceedings of the 5th International Workshop on Finite State Methods in
Natural Language Processing (FSMNLP 2005), Helsinki, Finland. 
https://www.cis.uni-muenchen.de/~schmid/papers/SFST-PL.pdf.

ABOUT THE REVIEWER

Michael Maxwell is a Research Scientist at the Applied Research Lab for
Intelligence and Security (ARLIS, formerly known as CASL) at the University of
Maryland. Previously he worked at the Linguistic Data Consortium of the
University of Pennsylvania, and before that he served with SIL as a linguistic
consultant in Ecuador and Colombia. His interests are in phonology, morphology
and lexicography, and particularly with computational implementations of these
disciplines; and with low density and endangered languages.

------------------------------------------------------------------------------

***************************    LINGUIST List Support    ***************************
 The 2020 Fund Drive is under way! Please visit https://funddrive.linguistlist.org
  to find out how to donate and check how your university, country or discipline
     ranks in the fund drive challenges. Or go directly to the donation site:
                   https://crowdfunding.iu.edu/the-linguist-list

                        Let's make this a short fund drive!
                Please feel free to share the link to our campaign:
                    https://funddrive.linguistlist.org/donate/

----------------------------------------------------------
LINGUIST List: Vol-33-496	
----------------------------------------------------------