25.292, Review: Computational Linguistics; Historical Linguistics: Piotrowski (2012)

linguist at linguistlist.org linguist at linguistlist.org
Thu Jan 16 19:34:52 UTC 2014


LINGUIST List: Vol-25-292. Thu Jan 16 2014. ISSN: 1069 - 4875.

Subject: 25.292, Review: Computational Linguistics; Historical Linguistics: Piotrowski (2012)

Moderator: Damir Cavar, Eastern Michigan U <damir at linguistlist.org>

Reviews: 
Monica Macaulay, U of Wisconsin Madison
Rajiv Rao, U of Wisconsin Madison
Joseph Salmons, U of Wisconsin Madison
Mateja Schuck, U of Wisconsin Madison
Anja Wanner, U of Wisconsin Madison
       <reviews at linguistlist.org>

Homepage: http://linguistlist.org

Do you want to donate to LINGUIST without spending an extra penny? Bookmark
the Amazon link for your country below; then use it whenever you buy from
Amazon!

USA: http://www.amazon.com/?_encoding=UTF8&tag=linguistlist-20
Britain: http://www.amazon.co.uk/?_encoding=UTF8&tag=linguistlist-21
Germany: http://www.amazon.de/?_encoding=UTF8&tag=linguistlistd-21
Japan: http://www.amazon.co.jp/?_encoding=UTF8&tag=linguistlist-22
Canada: http://www.amazon.ca/?_encoding=UTF8&tag=linguistlistc-20
France: http://www.amazon.fr/?_encoding=UTF8&tag=linguistlistf-21

For more information on the LINGUIST Amazon store please visit our
FAQ at http://linguistlist.org/amazon-faq.cfm.

Editor for this issue: Rajiv Rao <rajiv at linguistlist.org>
================================================================  

Visit LL's Multitree project for over 1000 trees dynamically generated
from scholarly hypotheses about language relationships:
          http://multitree.linguistlist.org/
					
					

Date: Thu, 16 Jan 2014 14:34:22
From: Bev Thurber [b.thurber at shimer.edu]
Subject: Natural Language Processing for Historical Texts

E-mail this message to a friend:
http://linguistlist.org/issues/emailmessage/verification.cfm?iss=25-292.html&submissionid=21478069&topicid=9&msgnumber=1
 
Discuss this message: 
http://linguistlist.org/pubs/reviews/get-review.cfm?subid=21478069


Book announced at http://linguistlist.org/issues/24/24-2996.html

AUTHOR: Michael  Piotrowski
TITLE: Natural Language Processing for Historical Texts
SERIES TITLE: Synthesis Lectures on Human Language Technologies
PUBLISHER: Morgan & Claypool Publishers
YEAR: 2012

REVIEWER: Bev Thurber, Shimer College

SUMMARY

In this book, Michael Piotrowski summarizes much of the state of the art on
how techniques from natural language processing (NLP) have been applied to
texts written in historical variants of modern languages. This book is a
volume of the ''Synthesis Lectures on Human Language Technologies,” a series
which claims to ''provide concise, original presentations of important
research and development topics'' (back cover). The intended audience consists
of readers with backgrounds in either NLP or the humanities.

In the book's nine chapters, Piotrowski shows what has been done in this field
and what problems are unique to processing historical texts. The main problem,
which the book repeatedly returns to, is that of spelling variation. Two
complete chapters are devoted to this subject, and other chapters frequently
mention it. Other problems that are mentioned include the fact that there are
no living native speakers of historical languages, and that there are often
few comparable texts, resulting in a small corpora to which the techniques
described can be applied.

The book begins with a brief introduction outlining the concepts to be
presented, the scope of the book and its overall structure, and the intended
audience. Rather than giving a definition of what ''historical language''
means in this context, Piotrowski gives examples of features of modern
languages that historical languages lack, which define the challenges of
processing them. These features are standard variants and orthographies,
reasonably-sized corpora, and existing processing tools.

Chapter 2, ''NLP and Digital Humanities,'' provides a broad overview of the
field of study. Its goal is to situate NLP within the digital humanities. The
chapter cites examples of how NLP has been used to solve problems in the
humanities, highlighting the potential of NLP techniques and stressing the
importance of a thorough understanding of both fields. Piotrowski concludes
with the opinion that ''both the humanities and NLP could very much benefit
from increased collaboration'' (p. 10).

In Chapter 3, ''Spelling in Historical Texts,'' Piotrowski begins his
treatment of the major problem in dealing with historical texts:
non-standardized spelling. The chapter begins with an explanation of why this
is a problem, and then describes different types of spelling variation. These
types are difference (i.e. diachronic variation), variance (i.e. synchronic
variation), and uncertainty (i.e. variation introduced by the digitization
process). Data from spell-checkers and taggers is used to illustrate the
problems caused by spelling variation.

At 28 pages, Chapter 4, ''Acquiring Historical Texts,'' is the longest in the
book. It provides an overview of current digitization projects and methods of
digitizing texts, including scanning, optical character recognition, manual
text entry, and computer-aided transcription. Piotrowski discusses the
strengths, weaknesses, and limitations of each method in approximately the
order that they would be used in digitizing a text. Scanning to turn a written
text into a digital image is generally the first step. Optical character
recognition (OCR) is then applied to the image to turn it into electronic
text.  While OCR works very well for modern texts and is particularly stressed
as the best system currently available, it is not yet perfect. Much of the
chapter describes the adaptations to current OCR systems needed to make them
work well with historical texts. Some possible adaptations include using
several OCR systems and merging the results, linking an OCR system to a
lexicon, and providing a crowd-sourcing system for humans to correct OCR
output. Manual text entry and computer-assisted transcription are discussed as
alternatives to OCR that may provide better results in some circumstances.

Chapter 5, ''Text Encoding and Annotation Schemes,'' begins with
freshly-digitized text and describes how to encode and annotate it to make it
useful for researchers. Unicode is discussed for the former purpose, and the
Text Encoding Initiative (TEI) Guidelines for encoding text with Extensible
Markup Language (XML) are discussed for the latter. These two have emerged as
standards for this kind of work, and Piotrowski considers them ''a solid
foundation for encoding and processing many types of historical texts'' (p.
67).

In Chapter 6, ''Handling Spelling Variation,'' Piotrowski returns to the issue
of spelling irregularities discussed in Chapter 3. This chapter focuses on
specific problems that occur due to variations in spelling due to all three
types of variation. The focus is on languages that are still living and how
tools for the modern versions of those languages can be applied to the
historical versions. Piotrowski adds the caveat that ''[t]exts in dead or
extinct languages and scripts certainly pose additional challenges'' without
detailing how to deal with those challenges (p. 69). The major concept
discussed is canonicalizing the spelling in some way. Edit distance is
described as a way of comparing similar strings, which is relevant background
for Piotrowski's treatment of canonicalization methods. He describes both
absolute and relative methods, and then discusses ways to handle OCR errors
and the limits of canonicalization.

Chapter 7, ''NLP Tools for Historical Languages,'' summarizes some
currently-available NLP tools that have been applied to historical texts. This
chapter's point is ''not to give an exhaustive listing of available tools, but
rather to illustrate the variety of approaches that may be used for creating
NLP tools for historical languages'' (p. 85). The techniques discussed are
Part of Speech Tagging (both creating a new tagger for a historical language
and using an existing tagger for an ancestor of its target language),
lemmatization and morphological analysis, and syntactic parsing. Spelling
variation remains a problem in this area, resulting in low tagger accuracies
than those achieved with modern languages. However, Piotrowski points to
several reasons, including a lack of native speakers, that lead one to expect
lower performance standards for taggers applied to historical languages.

Chapter 8, ''Historical Corpora,'' is a list of corpora that have been
developed for Arabic, Chinese, Dutch, English, French, German, the Nordic
languages, Latin and Ancient Greek, and Portuguese. The author dedicates a few
pages to each language  containing brief descriptions of some available
corpora, along with instructions on obtaining them. The corpora represent
different approaches, including different formats and licenses.

The book concludes with Chapter 9, which provides a couple of pages of summary
and looks to the future. Piotrowski sees three challenges in the future of
this field: to deal with variation in historical languages, to develop tools
for marked-up text processing, and to connect NLP and the digital humanities
(p. 118).

A 25-page bibliography concludes the book. A nice feature of it is that each
entry includes the numbers of pages on which the resource was referenced,
allowing a reader to browse the bibliography and be able to find a longer
description of texts that seem interesting.

EVALUATION

The publisher's description of the series is quite accurate for this book.
Piotrowski packs a lot of valuable information into its 145 pages. While the
book is not, and does not claim to be, a complete summary of everything that
has been done in the field, it provides a concise explanation of the high
points and numerous avenues for future research. The book “does not aim to
teach a certain set of core techniques but rather tries to give an overview of
projects and the methods used therein” (p. 117). As a result, the examples
presented take a range of approaches, but focus on relevant standards, or
emerging standards, when appropriate, as in the case of Unicode and TEI. The
back cover suggests that the topics covered in this book are also relevant to
a variety of modern types of texts, including text messages and online
postings. While this statement seems true, these genres are not referenced in
the text.

The book provides a valuable introduction to the field for humanists who want
an overview of how NLP techniques have been used with historical texts and
what promise NLP holds for the future. Readers from the humanities will need
some background in the digital humanities or computer science to be able to
fully appreciate all that this book provides. Occasionally, a concept from
computer science is introduced without the kind of explanation that a reader
without any background in the field may need (e.g. hashing on pp. 79-81). For
a reader with an NLP background who is interested in working with historical
texts, this book provides a concise and up-to-date summary of the major
problems and methods specific to such texts.

One limitation of the book is that it mainly focuses on languages written
using the Roman alphabet. Since the Roman alphabet presents more than enough
problems, this should be considered an appropriate limit to the book's scope
rather than an omission. Mentions of non-Roman systems include tools for
handling Greek and corpora for Arabic and Chinese as well as additional
challenges associated with other writing systems, such as cuneiform or
Egyptian hieroglyphs.

Overall, the book is exactly what it claims to be: a good overview of recent
progress and problems in applying techniques from NLP to historical texts. It
covers the entire processing cycle, from creating a digital text, to tools to
analyze it, to existing corpora. The varied approaches described in the book
provide many starting points for investigations as well as the necessary
references to help a reader follow up on any of those starting points.


ABOUT THE REVIEWER

B.A. Thurber is an Assistant Professor of Humanities and Natural Sciences in
Chicago who is interested in historical linguistics.








----------------------------------------------------------
LINGUIST List: Vol-25-292	
----------------------------------------------------------
Visit LL's Multitree project for over 1000 trees dynamically generated
from scholarly hypotheses about language relationships:
          http://multitree.linguistlist.org/
					
					



More information about the LINGUIST mailing list