31.1581, Review: Computational Linguistics; Historical Linguistics: Toner, Han (2019)

The LINGUIST List linguist at listserv.linguistlist.org
Tue May 12 20:38:46 UTC 2020


LINGUIST List: Vol-31-1581. Tue May 12 2020. ISSN: 1069 - 4875.

Subject: 31.1581, Review: Computational Linguistics; Historical Linguistics: Toner, Han (2019)

Moderator: Malgorzata E. Cavar (linguist at linguistlist.org)
Student Moderator: Jeremy Coburn
Managing Editor: Becca Morris
Team: Helen Aristar-Dry, Everett Green, Sarah Robinson, Lauren Perkins, Nils Hjortnaes, Yiwen Zhang, Joshua Sims
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org

Homepage: http://linguistlist.org

Please support the LL editors and operation with a donation at:
           https://funddrive.linguistlist.org/donate/

Editor for this issue: Jeremy Coburn <jecoburn at linguistlist.org>
================================================================


Date: Tue, 12 May 2020 16:38:31
From: Mark Faulkner [mark.faulkner at tcd.ie]
Subject: Language and Chronology

 
Discuss this message:
http://linguistlist.org/pubs/reviews/get-review.cfm?subid=36588097


Book announced at http://linguistlist.org/issues/30/30-3419.html

EDITOR: Gregory  Toner
EDITOR: Xiwu  Han
TITLE: Language and Chronology
SUBTITLE: Text Dating by Machine Learning
SERIES TITLE: Language and Computers
PUBLISHER: Brill
YEAR: 2019

REVIEWER: Mark Faulkner, Trinity College Dublin

SUMMARY

Language and Chronology investigates the utility of machine-learning
techniques for dating undated medieval Irish texts. More specifically, it asks
whether a temporal model built from the year-by-year record of events offered
by the Irish Annals (generally thought to have been maintained
contemporaneously over perhaps a thousand years) can date texts from other
genres to 101-year windows.

The book begins with an introduction, which outlines the problem of dating
undated texts in the terms of both philology and machine learning, borrowing
from archaeology the term ‘chronometrics’ to refer to a method that attempts
‘to provide absolute dates’ for texts ‘with a defined margin of error’ (p. 6).

Chapter 1, ‘Dating Texts: Principles and Methods’ is a detailed summary of the
methods that philologists have traditionally used to date the composition of
undated medieval Irish texts. Much of what it has to say applies equally to
medieval texts in other languages, though it is clear that several facets of
the Irish tradition make it a particularly interesting case study for dating
methods, not least the survival of a significant number of texts generally
thought to have been composed at a relatively early date only in manuscripts
copies written some centuries later, a tendency for writers to claim
authorship of texts in fact written earlier by others, and a fondness for
deliberate stylistic archaisms. 

Chapter 2, ‘Computational Approaches to Text Dating’, introduces how machine
learning has approached the problem of text dating, describing the different
linguistic features that have been targeted (which include named entities,
keywords and word or character n-grams, as well as metalinguistic or
extralinguistic features such as text length or font) and the various
techniques adopted (including language modelling, regression and
classification). On the basis of previous studies, Toner and Han adopt a
classification-based approach, in which the machine builds a model to assign
texts to given dating windows (e. g. 1400±50 = 1349-1450). They then outline
five new techniques that might improve the dating performance of their
algorithm, the force of which is in essence to allow the dating windows to be
derived from the language of the texts themselves, rather than imposed by the
analysts (so that instead of 1400±50, we might have 1407±17 if that is what
best suits the texts).

Chapter 3 trials these new techniques in English and medieval Irish texts. For
English, they use two datasets: one off over 6,000 news snippets published
between 1700 and 2010, chosen because it was used in the DTE Diachronic Text
Evaluation task introduced at SemEval-2015 and therefore allowed ready
comparison of their results with those from earlier studies, the other of
almost 2,500 adverts posted on the website Freecycle over a period of 180
days. Targeting character and word n-grams (with n = 1, 2, 3), their algorithm
correctly dated 54% of texts from the DTE dataset within a 21-year window, and
43% of those from Freecycle within a 21-day window. They then turn to the
annals, telling the algorithm to look only at character n-grams. With the
Annals of Innisfallen, it manages to date 74% of segments to the correct
51-year window.

Chapter 4, ‘Dating Long Documents’, examines whether a dating model derived
from the annals is applicable to other undated Irish texts. Their test corpus
comprises 22 ‘longer medieval Irish texts’, ranging from 263 to 80,020 words.
These were divided into chunks of 20 or more words, and the algorithm asked to
predict a date for each chunk. Asking it to return a 101-year window, its most
frequent prediction for each text coincided with existing philological opinion
32% of the time. Asking it to return a 21-year window, but stipulating this
must fall within the 101-year window already established, lead to a slightly
improved performance, coinciding with existing philological opinion half the
time. The model fared best with Middle Irish texts, less well with Old and
Early Modern Irish, but it nonetheless assigned most texts to their correct
period. Since the algorithm tries to date texts chunk by chunk, there is, they
show, some scope to use it to distinguish different strata within a text, such
as where an Early Modern reviser has extended a Middle Irish text.

A conclusion reviews the success of Toner and Han’s approach, briefly looking
inside the ‘black box’ of the algorithm and considering what linguistic
features it might have been using to date the texts. It is followed by two
appendices, the first a lengthy outline of the dates philologists have usually
assigned to the texts on which the approach was tested in Chapter 4 and the
second a brief outline of some basic concepts from machine learning. A
bibliography closes the book.

EVALUATION

Better datings for undated medieval texts are a major desideratum. As Toner
and Han report (p. 15), the medieval Welsh text, the Four Branches of the
Mabinogi, has been dated anywhere in the two and a half centuries between 1018
and 1275. In English studies, some scholars continue to advocate an origin for
the epic Beowulf in the eleventh century even as a compelling array of
evidence suggests c. 700 is a more reasonable date. Any new approach is
therefore to be welcomed and it is probable that a machine learning implicitly
from extant texts will notice patterns a human cannot.

That said, Toner and Han’s approach is, by objective standards, a failure.
Their conclusion is that the dating model derived from the annals ‘can be
applied to long narrative texts of various genres’ (p. 116) and ‘could be used
as a tool for assigning texts to a linguistic period’ (p. 137). This would
perhaps have some utility if all knowledge of Irish was lost, but a large
corpus, their algorithm and some people (interested in Irish) survived, but
notwithstanding such an esoteric doomsday scenario, it is difficult to see
what use such a tool would have. But this is clearly a naïve view that ignores
the incremental nature of scientific work: Toner and Han’s research will, we
have to hope, be built upon by others and, in due course, machines will better
date medieval texts than humans.

What should those machines be told to look at? Toner and Han’s algorithm, as
we have seen, looks at character unigrams, bigrams and trigrams. N-grams
primarily target orthography and may indirectly pick up features of
inflectional morphology (in Old English, one thing the trigram <um > could be
is the dative plural morpheme -um and its frequent occurrence would probably
point to a date before the twelfth century). But orthography is the feature a
modernising scribe can most readily alter when copying a text; it is no
surprise therefore that Toner and Han note their algorithm sometimes generates
predictions which correspond more closely to manuscript date than presumed
composition date. Other linguistic features are less easy to modernise and the
general scholarly consensus (at least in Anglocentric medieval studies) is
that scribes intervened very little with syntax. Focusing on syntax would
require the development of a part of speech tagger and parser for medieval
Irish, but using word n-grams might perhaps superficially pick up some
underlying syntactic patterns, much as character n-grams pick up some
morphological ones.

The meld of philological and computational methods on show in this book is a
stimulating one. Philology has always been about contextualising particular
linguistic forms, and this is in effect what a machine attempting a
classification task undertakes. It is salutary to someone working in a
philological tradition where it is quite normal to rely still on work
undertaken in the 1880s to encounter the section in the chapter on
computational approaches to text dating entitled ‘Early Research’ and notice
the earliest paper it cites is from 2005. Toner and Han helpfully include in
their introduction instructions on ‘how to read this book’, counselling those
coming from a humanities backgrounds to ignore the body of Chapters 2 and 3.
It is certainly true these are very difficult, but while the appendix with its
brief definition of some terms from machine learning is helpful, more could
have been done to make them more accessible. This reader at least would have
preferred to be told what ‘prototype methods’ and a ‘non-parametric
memory-based distances function’ do linguistically rather than learn that the
Wiener process is ‘named in honour of Norbert Wiener’ (pp. 42-3). This is a
serious deficiency in that, as I have argued in the previous paragraph, what
the machine is told directly or indirectly to look at does matter and if
philologists cannot understand what the machine is doing, they cannot advise
on what it should be told to look at. But explaining the methods of one
discipline to those of another is difficult and it is to Toner and Han’s
credit that they have generally explained their methods and results clearly.
The first chapter, ‘Dating Texts’, is a model of clarity, making accessible to
a wider audience an otherwise challenging body of scholarship, much of it
written in Irish, and could easily be set as reading for a class on a
philology course on a masters in Medieval Studies; the appendix on the datings
philologists have assigned to the Irish texts used in the study will be an
invaluable reference point for scholars from a range of different disciplines.

Language and Chronology lays the foundations for the next generation of work
on a crucial philological problem. It is to be hoped many computer scientists
interested in text dating will seize the challenge that is offered by medieval
texts, with their unstandardized orthographies, erratic attestations and want
of tools like parsers that are de rigueur for the languages like Present-Day
English on which most chronometric work is focused.


ABOUT THE REVIEWER

Mark Faulkner is Ussher Assistant Professor in Medieval Literature at Trinity
College Dublin. His work on twelfth-century English has lead to an interest in
periodisation and text dating, and he has recently been received a Provost’s
Project Award for Medieval Big Dating, which will explore quantitative and
methods to develop ‘big data’ techniques to assist in the dating of texts from
the Old and early Middle English periods.





------------------------------------------------------------------------------

***************************    LINGUIST List Support    ***************************
 The 2019 Fund Drive is under way! Please visit https://funddrive.linguistlist.org
  to find out how to donate and check how your university, country or discipline
     ranks in the fund drive challenges. Or go directly to the donation site:
               https://iufoundation.fundly.com/the-linguist-list-2019

                        Let's make this a short fund drive!
                Please feel free to share the link to our campaign:
                    https://funddrive.linguistlist.org/donate/
 


----------------------------------------------------------
LINGUIST List: Vol-31-1581	
----------------------------------------------------------






More information about the LINGUIST mailing list