26.1458, Review: Comp Ling; Forensic Ling; General Ling; Text/Corpus Ling: Oakes (2014)
The LINGUIST List via LINGUIST
linguist at listserv.linguistlist.org
Tue Mar 17 15:23:01 UTC 2015
LINGUIST List: Vol-26-1458. Tue Mar 17 2015. ISSN: 1069 - 4875.
Subject: 26.1458, Review: Comp Ling; Forensic Ling; General Ling; Text/Corpus Ling: Oakes (2014)
Moderators: linguist at linguistlist.org (Damir Cavar, Malgorzata E. Cavar)
Reviews: reviews at linguistlist.org (Anthony Aristar, Helen Aristar-Dry, Sara Couture)
Homepage: http://linguistlist.org
************* LINGUIST List 2015 Fund Drive *************
Please support the LL editors and operation with a donation at:
http://funddrive.linguistlist.org/
Editor for this issue: Sara Couture <sara at linguistlist.org>
================================================================
Date: Tue, 17 Mar 2015 11:22:24
From: Bev Thurber [b.thurber at shimer.edu]
Subject: Literary Detective Work on the Computer
Discuss this message:
http://linguistlist.org/pubs/reviews/get-review.cfm?subid=35977877
Book announced at http://linguistlist.org/issues/25/25-2660.html
AUTHOR: Michael P. Oakes
TITLE: Literary Detective Work on the Computer
SERIES TITLE: Natural Language Processing 12
PUBLISHER: John Benjamins
YEAR: 2014
REVIEWER: Bev Thurber, Shimer College
Review's Editor: Helen Aristar-Dry
SUMMARY
This book provides a concise summary of the ways computational linguistics has
been used to obtain certain types of information about texts. The applications
discussed are authorship identification, plagiarism identification and spam
filtering, Shakespearean authorship, style in religious texts, and
decipherment. Each of these topics is discussed in a chapter of approximately
50 pages. The book begins with a brief preface summarizing the content and
explaining the its structure, which brings all of the chapters together under
the heading of computer stylometry.
Chapter 1, “Author identification,” provides a summary of some of the basic
techniques that are applied in later chapters and shows how they have been
used to determine who wrote a text when the author is unknown. The chapter
opens with a short introduction to the problem, then discusses features that
different scholars have based their evaluations on. Two major types of
technique are discussed: inter-textual distances and clustering. Inter-textual
distances are ways of measuring the similarity between two texts based on
their shared features (11). The examples provided focus on comparing of the
vocabularies of two texts. The distances discussed include the Euclidean and
chi-squared distances, Kullback-Leibler Divergence, and others. Mathematical
formulae are provided along with explanations. The Euclidean distance is based
on the idea of geographical distance, i.e. the shortest line from one point to
another (12). The chi-squared distance is similar to the Euclidean distance,
but with the addition of weights that reflect the number of times a word
occurs in each text (15). Kullback-Leibler Divergence is an application of the
scientific idea of entropy to comparing texts (18). Clustering techniques
start with textual features and transform them, by means of inter-textual
distances, into diagrams of how sets of more than two texts are related
(30).The section on clustering techniques focuses on factor analysis
techniques, especially principal components analysis, which is used in later
chapters. A principal components analysis begins with a table of data, such as
normalized word frequencies from several texts. Standard techniques from
linear algebra are then applied to compute the principal components,
orthogonal eigenvectors that can be used to produce a graph showing how
closely related the texts are (38 - 44). The chapter ends with a section
comparing the different methods described and examples of related studies.
These studies do not directly address unknown authorship, but ask
closely-related questions, such as how an author's writing style changes over
her or his lifetime.
Chapter 2, “Plagiarism and spam filtering,” includes two main sections, one on
each of these topics. The plagiarism section is approximately twice as long as
the spam filtering section. It begins with a discussion of commercially
available plagiarism-detection software, then goes on to describe the
algorithms behind the software, with applications to student essays and
program code. A variety of ways to measure document similarity are discussed,
including the cosine measure, overlapping n-grams, fingerprinting, language
modeling, and techniques from natural language processing. These techniques
require that a suspicious document be compared to a corpus of similar
documents. When such a corpus is unavailable, intrinsic techniques, which
examine the text's writing style, can be used instead. Oakes describes such
techniques in this section as well. Finally, this section of the chapter
addresses the problems of plagiarism by translation and how to tell which of
two similar texts is the original. The second part of Chapter 2, on spam
filtering, covers a variety of approaches to the problem. These include
content-based, exact matching, and rule-based methods as well as approaches
based on machine learning and some outside of the linguistic realm.
Chapter 3, “Computer studies of Shakespearean authorship,” returns to some of
the questions addressed in Chapter 1 with a specific focus on Shakespeare. The
chapter summarizes questions about which plays were or were not written by
Shakespeare and the answers obtained by computational methods. The plays are
divided into three categories, ''Traditional attributions,'' ''Dubitanda,''
and ''Apocrypha,'' following Elliot and Valenza (1996). The 35 plays in the
first category provide a control set for the analyses that follow, as there is
no reason to doubt that Shakespeare wrote those plays himself. The other two
categories, containing plays in which Shakespeare's involvement may have been
less than full authorship, provide subjects for the analyses described in the
chapter. Computational methods have been used to explain Shakespeare's level
of involvement in writing plays from these sets. The principal components
analysis discussed in Chapter 1 is shown in action in this chapter, and other
methods beyond those presented in Chapter 1 are discussed, including Bayesian
analyses and neural networks.
Chapter 4, “Stylometric analysis of religious texts,” addresses questions
related to those in Chapters 1 and 3 while focusing on the New Testament, the
Book of Mormon, and the Qu'ran. The first of these takes up the bulk of the
chapter because less work has been done on the other two. The chapter
describes analyses done to answer questions related to authorship by means of
writing style, with correspondence analysis and cluster analysis as the most
frequently-mentioned methods. Some time is spent on a discussion of the
hypothetical source Q for the Gospels of Matthew and Luke. Correspondence
analyses showing how the gospels may be related to each other and to Q are
summarized in detail. Other New Testament topics covered include possible
relationships between all the books of the New Testament derived using the
methods of prediction by partial match and word recurrence interval. The
sections on the Book of Mormon and the Qu'ran are summaries of similar studies
done on those books.
Chapter 5, “Computers and decipherment,” summarizes ways computational
techniques have been and could be useful in analyzing unknown writing systems.
The most well-known decipherments have made little use of computers, but Oakes
suggests that computers could be useful for ''routine tasks like collating and
counting'' (207). The chapter relates decipherment to cryptography and machine
translation and considers Rongorongo and the Indus Valley seals as case
studies. Some attention is also paid to Linear A, Pictish symbols, and Mayan
glyphs. One question discussed in this chapter is how to tell whether a set of
symbols encodes a particular language. Some statistical properties of
language, such as Zipf's law and Sinkov's test, are explained as pointers
toward an answer to this question. The chapter concludes on the pessimistic
note that decipherment of these unreadable scripts is unlikely, but with the
hope that interesting new computational methods will be developed in the
attempt.
The book ends with a long list of references and a short index.
EVALUATION
This book is the twelfth volume of John Benjamins' Natural Language Processing
series. Edited by Ruslan Mitkov of the University of Wolverhampton; this
series focuses on ''new results in NLP and modern alternative theories and
methodologies'' (back of half-title page). This particular volume fits into
the series framework by providing a very broad, yet concise, summary of what
has been done in recent years. The book focuses on methods for analyzing style
with an eye to determining the author of a given text. It provides a mix of
theory (in the form of computations that can be performed) and application (in
the form of case studies). The first chapter, on authorship attribution, lays
the foundation for the rest of the book. Chapters 2 through 4 are clear
follow-ups to Chapter 1 as they present case studies of questioned authorship.
Chapter 5 treats a topic that is related to these, but not quite the same.
Rather than questions of who wrote a text, this chapter is concerned with
whether a given sequence of symbols is a text and, if so, how one can
determine what it says.
The division of the material into chapters based on applications rather than
on methods makes the book's focus seem to be on what has been done rather than
how it was done. Someone looking to solve a particular problem will be able to
see what techniques have been used for that problem (or similar ones). This
makes the book useful for someone looking for ideas to try out on a particular
problem related to the ones discussed. This results in some repetition of
methods; principal components analysis is one technique that appears
repeatedly in the book. It is first explained in Chapter 1, then appears again
in Chapters 3 through 5, where it is shown in action through case studies.
This repetition ensures that readers interested in a particular area of study
see this important technique in action.
According to the back cover, ''[t]his book is written for students and
researchers of general linguistics, computational and corpus linguistics, and
computer forensics.'' Graduate students and other researchers in the early
stage of their careers seem the most likely to benefit from the book's system
of organization, and the level of mathematics presented is consistent with
this. The author assumes that his audience understands basic statistics, but
may not be familiar with other mathematical topics, such as vectors and matrix
algebra. The book provides a concise overview of many mathematical methods and
includes details of the mathematics behind some of these these, providing, for
example, a detailed tutorial on matrix arithmetic (pp. 35-38).
In keeping with this audience, the book occasionally presents source code for
the methods discussed in the programming language R in order to show
concretely how the mathematical methods described can be used. These range
from very simple, such as the description of how to calculate a dot product on
page 43, to the code for creating a sorted frequency list shown on page 242.
These examples are occasional enough to make it clear that the book is not
intended as a primer on R, but readers may find them helpful as an
introduction to R and guide to implementation.
The back cover states that this book ''will inspire future researchers to
study these topics for themselves, and gives sufficient details of the methods
and resources to get them started.'' This is an accurate summary of what seems
to be the primary purpose of the book. As a source of starting points for
research, this book is a great resource that will be helpful to anyone looking
for inspiration.
REFERENCES
Elliot, Ward and Robert Valenza. 1996. And then there were none: Winnowing the
Shakespeare claimants. Computers and the Humanities 30: 191-245. DOI:
10.1007/BF00055107.
ABOUT THE REVIEWER
B. A. Thurber is an Assistant Professor of Humanities and Natural Sciences at
Shimer College in Chicago, IL who is interested in historical and
computational linguistics and medieval ice skating.
----------------------------------------------------------
LINGUIST List: Vol-26-1458
----------------------------------------------------------
Visit LL's Multitree project for over 1000 trees dynamically generated
from scholarly hypotheses about language relationships:
http://multitree.org/
More information about the LINGUIST
mailing list