29.3607, Review: Historical Linguistics; Text/Corpus Linguistics: Jenset, McGillivray (2017)

The LINGUIST List linguist at listserv.linguistlist.org
Wed Sep 19 18:48:45 UTC 2018


LINGUIST List: Vol-29-3607. Wed Sep 19 2018. ISSN: 1069 - 4875.

Subject: 29.3607, Review: Historical Linguistics; Text/Corpus Linguistics: Jenset, McGillivray (2017)

Moderator: linguist at linguistlist.org (Malgorzata E. Cavar)
Reviews: reviews at linguistlist.org (Helen Aristar-Dry, Robert Coté)
Homepage: https://linguistlist.org

Please support the LL editors and operation with a donation at:
           https://funddrive.linguistlist.org/donate/

Editor for this issue: Jeremy Coburn <jecoburn at linguistlist.org>
================================================================


Date: Wed, 19 Sep 2018 14:48:21
From: Foinse Ó Caoimh [terek.temirbay at gmail.com]
Subject: Quantitative Historical Linguistics

 
Discuss this message:
http://linguistlist.org/pubs/reviews/get-review.cfm?subid=36380160


Book announced at http://linguistlist.org/issues/28/28-5178.html

AUTHOR: Gard B. Jenset
AUTHOR: Barbara  McGillivray
TITLE: Quantitative Historical Linguistics
SUBTITLE: A Corpus Framework
PUBLISHER: Oxford University Press
YEAR: 2017

REVIEWER: Foinse Ó Caoimh, Maynooth University

SUMMARY

Corpora and quantitative methods have been extensively employed in many
sub-fields of linguistics for decades. However, historical linguists seem to
be more reluctant to embrace the corpus-driven quantitative approach, and
there are fewer corpora of historical languages available. Recent introductory
books to historical linguistics (Ringe and Eska 2013, Campbell 2013, Hale
2007, etc.) remain reticent or even hostile towards quantitative methods,
despite that there has been a large amount of scholarly output in corpus
building and natural language processing for historical languages (Gippert &
Gehrke 2015, Piotrowski 2012), and that fruitful results have been produced by
corpus-driven quantitative studies of historical languages, notably championed
by the two authors of the book under review (Jenset 2013, McGillivray 2013
etc.). The present book appeals therefore to the community of historical
linguists, and sets for itself triple tasks: firstly, to justify and advocate
for the corpus-driven quantitative approach in historical linguistics; second,
to outline the methodological framework of such an approach; and third, to
provide a general account of the current practices and techniques in
quantitative studies of historical languages.

Chapter 1 sets out the aim of the book upfront, namely to introduce a
methodological framework for quantitative historical linguistics by discussing
the necessary steps in doing research, without subscribing to specific
techniques or theories in historical linguistics (p.1). The authors point out
that while there is a high-level awareness of historical linguistics as
data-focused, quantitative corpus methods are still underused and often
misused, largely because the empirical nature of historical linguistics is
less clear (p.2). They argue for a ‘conceptual change of pace’ (p.6), whereby
the transparency and objective verifiability required by an empirical
discipline should be conceptualised in a probabilistic rather than a
categorical way (p.4). The conventional evidence-based approach provides the
categorical judgment that a certain phenomenon exists, but fails to inform on
its frequency or trend of change (pp. 8-10), for which one needs annotated
corpora (pp.10-12). Even when using annotated corpora, historical linguists
must avoid the pitfalls of raw frequency counts and ‘post hoc analysis’
(pp.12-15), about which more details are given in Chapter 6. After a short
plea for better documenting and sharing the research process in order to
enhance reproducibility and collaboration (pp.15-18), and an even shorter
section advocating for pattern-searching in linguistics (pp. 18-19), the
authors turn to an interesting metaphor of ‘crossing the chasm’ (Moore 1991),
usually employed to model the acceptance of new products in the market, to
explain the possible tactics that can be used to help the majority of
historical linguists to ‘cross the chasm’ and accept the proposed methodology
in this book (pp.19-25). This amalgam chapter ends with a showcase study
(pp.25-35) that surveys articles from six journals that focus on historical
linguistics, using quantitative methods to find out the links between
individual journals, corpus-based research and quantitative-qualitative
distinction. 

Having prepared the readers’ minds for a paradigmatic shift from categorical
models to probabilistic ones, Chapter 2 outlines the methodological framework
which constitutes the cornerstone of this book. Several basic assumptions
pertaining to historical linguistics are made, such as that the historical
linguistic reality is lost and that qualitative models are still indispensable
in some fields, and key terms such as ‘evidence’ and ‘model’ are defined (pp.
37-44). A diagram (p.45) shows clearly the research process leading from
primary sources to ‘models of language that are quantitatively driven from
evidence’ (p.44). The authors list twelve principles (pp.44-53) for conducting
historical linguistic research under the proposed framework. These include
general principles for an empirical discipline (e.g. it is necessary to reach
consensus based on empirical argumentation (p. 45), as against, say, literary
criticism) as well as more subject-specific requirements (e.g. languages are
multivariate and should be studied as such (p. 51)). At this point the authors
address again the plea raised in Chapter 1 and propose several ‘best
practices’ (pp. 53-58) aiming to increase the reproducibility and
collaboration in the discipline. The last part of this chapter is dedicated to
an elucidation of the concept of ‘corpus-driven’ and ‘data-driven’ approaches
(pp.58-61), together with an epistemological probe of the relationship between
data and theory (pp. 61-65).

Chapter 3 reviews the early methods, both qualitative and quantitative,
employed in historical linguistics, especially that of glottochronology. The
authors reveal that the failure of glottochronology and the advent of
structuralism and generative theory together reduced the interest in
quantitative methods during the past decades (pp. 68-71). The rising of modern
electronic corpora provides exciting opportunities. As the authors
convincingly show with regression plot charts (pp. 74-78), the advance in
computing power in recent decades is highly relevant to the rapid growth in
both the number and the sizes of corpora of historical languages. The
distinction between qualitative and quantitative approaches, and the
advantages of the latter in certain contexts, is briefly restated (pp. 78-81),
followed by an extensive defence for the use of corpora and quantitative
methods in historical linguistics. Arguments against such methods from the
standpoints of convenience, redundancy, limitation of scope, principle and the
so-called ‘pseudo-science’ scepticism are mentioned and refuted (pp. 81-97). 

Chapter 4 advances from the problem of ‘why doing it’ into ‘how to do it’, by
introducing various current methods of annotating historical corpora. Compared
to corpora of contemporary languages, those of historical languages are in
greater need of detailed, interpretative annotations guided by philology (pp.
100-101). Data in the corpus can be structuralised to facilitate retrieval in
many ways, such as the table format and markup languages (pp. 103-106).
Structuralized data then can be further annotated in embedded or standalone
formats. Different levels of linguistic annotation are explained, starting
from pre-processing, tokenization to part-of-speech, morphological, syntactic
and even sociolinguistic annotations (pp. 110-122). The annotation schemes and
standards, the authors argue, should be implemented in the annotating process.
The Universal Dependencies standard and the Text Encoding Initiative (TEI) are
mentioned as promising candidates for standardizing the many existing markup
schemes (pp. 122-125). Many of the methods in this chapter are exemplified by
sample annotated data, and at this point(pp. 125-127) the authors illustrate
the application of automatic Natural Language Processing (NLP) tools to a
Latin corpus, although the efficacy of such tools in this particular case is
still not quite clear. The chapter ends with some reflections on the
limitations and risks of annotation.

Chapter 5 explores the possibility of (re)using resources. including not only
purposely built corpora, but also dictionaries, official documents and
historical archives. This is a highly original chapter and represents some of
the main breakthroughs the authors have made in recent publications. The
authors lucidly demonstrate, with concrete examples, how historical valency
lexicons automatically derived from Treebanks can contribute to our
understanding of languages to a greater extent than conventional dictionaries
(pp. 130-135). Such corpus-driven lexicons can in turn improve the precision
of Optical Character Recognition (OCR) and NLP tools. Historical linguistic
research can benefit from including information on social features of the
texts in the factors that influence language change, while sociolinguists and
historians are able to investigate a large number of source texts with the
help of quantitative corpus methods (pp. 137-140). One way to further
integrate more resources into the corpus is to add metadata, preferably in a
separate database linked to the corpus (pp. 140-142). Popular tools for
linking data include Resource Description Framework (RDF) and the Hypertext
Transfer Protocol (HTTP), and an example of linking a Treebank to the LexiInfo
ontology via RDF is given in detail on pp. 143-148. Historical and
geographical data can be linked to an annotated corpus in many innovative
manners as well, as exemplified by the Pleiades and the Pelagios projects (pp.
148-151).

The beginning of Chapter 6 reiterates the benefits of corpus and quantitative
methods (pp. 153-157, cf. pp. 8-15, 78-81). Since language is multivariate, as
suggested by Principle 11 (p. 51), the complexity should be tackled with
multivariate techniques (p. 157). The authors choose the problem of the
concurrence of Latin spatial preverbs and certain argument structures to
exemplify what these statistic techniques are and how they work (pp. 157-166).
A more complex investigation on the rise of the existential ‘there’ in Middle
English is then reported, showing the readers how to translate linguistic
claims into statistic questions, and how much the factors of word order,
sentence structure, genre and dialect each contributed to the change of
frequency of ‘there’ over time (pp. 166-186). This analysis offers a valuable
showcase of how to evaluate different statistical techniques and how to test
the model fit.

Chapter 7 summaries the core steps of the research process (pp. 189-190) and
presents yet another case study that implements the framework. This study
entertains an old problem, that of the variation between third person verbal
endings -(e)s and -(e)th in early modern English. It tests the various
hypotheses on the reasons for this variation and tries to establish the
relative importance among these reasons (pp. 190-206). The complete process at
the beginning of this chapter is followed and exemplified step by step in this
case study. The chapter concludes with some final remarks.

EVALUATION

This book is the first to systematically construct the methodological
framework of corpus-driven quantitative approaches in historical linguistics,
and it has done an excellent job both in proving the necessity and advantages
of the approach, and in providing a useful and clear framework for future
researches. In addition, it also serves as a comprehensive review of the
progress by corpus-driven quantitative methods so far, and its bibliography
can be used as an up-to-date reference list of major published corpora of
historical languages and corpus-driven quantitative studies of historical
languages. The book benefits from its inclusion of many sample data, charts
and figures, which are all made openly available by the authors online for
readers to repeat the tests or explore more possibilities.  

The editorial standard is high, and I only notice three typos: 1) ‘We can
distinguish between different types [of] evidence’ (p. 39); 2) ‘since a
particular feature or phenomenon can be absent from a corpus for a number [of]
very different reasons’ (p. 80); 3) the subsection title ‘Pragmatically and
sociolinguistically annotated corpora’ should be in bold type in accordance to
other titles on the same level.

The main problem with this monograph is its structure. As I summarise at the
first paragraph of this review, the triple task of this book is quite clear
and should be (and indeed can be) presented in its logical sequence, namely
firstly to justify the approach, secondly to lay out the framework, and
thirdly to present the actual uses and techniques. However, what one finds is
a mixed-up presentation of the three tasks. For example, the definitions of
‘qualitative’ and ‘quantitative’ methods (p. 79) are so basic for the whole
argument of this book, that they really should be put in the first chapter
together with 1.2.1 ‘Empirical methods’ (pp. 3-4); otherwise the ‘new pace’
switching from qualitative to quantitative approach proposed on p. 6 cannot be
precisely understood. Similarly, the distinction between ‘data-driven’ and
‘corpus-driven’ (pp. 58-61) should be raised at least to the ‘definitions’
section (pp. 39-44), whereas the in-depth discussion between data and theory
(pp. 61-65) may be better relegated to the defence for quantitative methods
against caveats in Chapter 3. As a matter of fact, the review of and defence
for quantitative methods (pp. 81-97) should come first in the book, while the
section of ‘problems with certain quantitative analyses’ (pp. 12-15) would
belong more naturally with Chapter 6. I understand that the authors may wish
to foreshadow in the first chapter some of the main points discussed in the
whole book, and to remind the readers of the important conclusions made in the
earlier parts (e.g. pp. 154-156 are no more than a summary of some points made
earlier), but these should be done in a way more consistent with the logic of
presentation. 

Notwithstanding these structural considerations, this book is still highly
recommendable. Corpus-driven quantitative approaches have huge potentials in
historical linguistics, and this treatise on methodology provides a firm
starting point for historical linguists to know and accept these approaches,
in an informative and accessible manner. I believe scholars will greatly 
benefit from reading this book.

REFERENCES:

Campbell, Lyle. 2013. Historical Linguistics: An Introduction (3rd edition).
Edinburgh: Edinburgh University Press.

Gippert, Jost and Ralf Gehrke (eds.). 2015. Historical Corpora. Challenges and
Perspectives (Korpuslinguistik und interdisziplinäre Perspektiven auf Sprache
5). Tübingen: Narr.

Hale, Mark. 2007. Historical Linguistics: Theory and Method. Malden:
Blackwell.

Jenset, Gard B. 2013. Mapping meaning with distributional methods: a
diachronic corpus-based study of existential there. Journal of Historical
Linguistics 3(2). 272-306.

McGillivray, Barbara. 2013. Methods in Latin Computational Linguistics.
Leiden: Brill.

Moore, Geoffrey A. 1991. Crossing the Chasm: Marketing and Selling High-Tech
Products to Mainstream Customers. New York: HarperCollins.

Piotrowski, Michael. 2012. Natural Language Processing for Historical Texts.
Williston: Morgan & Claypool.

Ringe, Don and Joseph F. Eska. 2013. Historical Linguistics: Towards a
Twenty-First Century Reintegration. Cambridge: Cambridge University Press.


ABOUT THE REVIEWER

Fangzhe Qiu is a postdoctoral researcher at the Chronologicon Hibernicum
project hosted in Maynooth University, Ireland. The project aims at mapping
the linguistic variations in the Old Irish language (7th to 10th century AD)
with quantitative and corpus-driven methods. Qiu's main research interest lies
in the Irish language, historical and comparative linguistics, and early Irish
law.





------------------------------------------------------------------------------

*****************    LINGUIST List Support    *****************
Please support the LL editors and operation with a donation at:

              The IU Foundation Crowd Funding site:
       https://iufoundation.fundly.com/the-linguist-list

               The LINGUIST List FundDrive Page:
            https://funddrive.linguistlist.org/donate/
 


----------------------------------------------------------
LINGUIST List: Vol-29-3607	
----------------------------------------------------------






More information about the LINGUIST mailing list