29.4696, Review: Computational Linguistics; Text/Corpus Linguistics: Desagulier (2018)

The LINGUIST List linguist at listserv.linguistlist.org
Tue Nov 27 00:59:34 UTC 2018


LINGUIST List: Vol-29-4696. Mon Nov 26 2018. ISSN: 1069 - 4875.

Subject: 29.4696, Review: Computational Linguistics; Text/Corpus Linguistics: Desagulier (2018)

Moderator: linguist at linguistlist.org (Malgorzata E. Cavar)
Reviews: reviews at linguistlist.org (Helen Aristar-Dry, Robert Coté)
Homepage: https://linguistlist.org

Please support the LL editors and operation with a donation at:
           https://funddrive.linguistlist.org/donate/

Editor for this issue: Jeremy Coburn <jecoburn at linguistlist.org>
================================================================


Date: Mon, 26 Nov 2018 19:59:14
From: Gözde Mercan [gozdebahadir at gmail.com]
Subject: Corpus Linguistics and Statistics with R

 
Discuss this message:
http://linguistlist.org/pubs/reviews/get-review.cfm?subid=36401537


Book announced at http://linguistlist.org/issues/29/29-537.html

AUTHOR: Guillaume  Desagulier
TITLE: Corpus Linguistics and Statistics with R
SUBTITLE: Introduction to Quantitative Methods in Linguistics
SERIES TITLE: Quantitative Methods in the Humanities and Social Sciences
PUBLISHER: Springer
YEAR: 2018

REVIEWER: Gözde (Bahadır) Mercan, University of Oslo

SUMMARY

“Corpus Linguistics and Statistics with R” by Guillaume Desagulier is a book
introducing the principal methods and statistics in corpus linguistics using
the programming language R (R Core Team 2018). The author states that “This is
a book on empirical linguistics from a theoretical linguist’s perspective” (p.
viii). It provides not only clear, hands-on, step-by-step instructions on how
to apply these techniques, but also some theoretical discussion on the scope
of corpus linguistics. 

While its target audience is mainly novices in the fields of programming,
statistics and cognitive linguistics, the book may also be of interest to more
experienced researchers. As stated in the back cover, it is suitable for use
as a textbook in graduate and advanced undergraduate courses as well as
self-study. 

The book is part of Springer’s “Quantitative Methods in the Humanities and
Social Sciences” series. It consists of two parts, 10 chapters and 353 pages. 

In the Preface, Desaguiler starts with a personal anecdote which motivated him
to acquire empirical techniques and how he was inspired by Stephan Th. Gries
(whose work is frequently cited throughout the book, e.g. Gries 2009) while
explaining the intended readership of the book. The preface also presents
information regarding the goals of the books and the online supplementary
materials as well as some notes to instructors. 

Chapter 1, “Introduction”, presents the theoretical relevance of
corpus-informed judgments by contrasting the top-down generativist approach to
language (e.g. Chomsky 1957) with the derivative, bottom-up approach of
usage-based theories such as cognitive linguistics (e.g. Langacker 1987), for
example. In this chapter, Desagulier also explains what makes a corpus by
presenting the required criteria and how linguists make use of corpora. He
finishes the chapter with an explanation of the role of a corpus within the
empirical cycle of a linguist’s work.

The nine chapters following the introductory chapter are grouped into two
parts. Part I is entitled: “Methods in Corpus Linguistics”; it includes 5
chapters. Part II, entitled “Statistics for Corpus Linguistics” contains the
remaining four chapters. 

Chapter 2, the first chapter in Part II, is a practical introduction to the R
programming language. 
It acquaints the reader with the fundamental notions of R and provides
step-by-step instructions starting with downloading and installing R. It
presents basic R-concepts like scripts, packages, variables, assignment,
functions and arguments. The chapter also introduces the four main types of R
objects, namely vectors, lists, matrices and data frames.

Chapter 3, entitled “Digital Corpora”, presents the different types of
corpora. In this chapter, Desagulier also outlines the steps involved in
corpus compilation. The chapter contains guidelines for creating one’s own
unannotated corpus, and also introduces the properties of ready-made,
annotated corpora such as markup, part-of-speech (POS-) tagging and semantic
tagging. 

Chapter 4 is entitled “Processing and Manipulating Character Strings”. In this
chapter, the author aims to teach the basic methods for handling text material
with R to lay the basis for applied character string processing. He covers the
relevant R functions and regular expressions.

In Chapter 5, “Applied Character String Processing”, Desagulier makes use of
and combines the R methods he presented in the previous chapter to demonstrate
how to handle text material.  He describes basic corpus linguistics operations
and covers concordances, data frame creation from an annotated corpus and
frequency lists. 

Chapter 6, which is the final chapter of Part I, aims to teach the readers how
to summarize frequency data graphically. There are instructions demonstrating
the construction of plots, barplots, histograms, word clouds, motion charts
and other visual representations to summarize results. 

Part II of the book, “Statistics for Corpus Linguistics”, opens with a short
introductory section, emphasizing the relevance and importance of statistics
for contemporary linguistics despite some ongoing misconceptions. 

The first chapter in this part, Chapter 7, consists of a concise introduction
to descriptive statistics. It presents key concepts of descriptive statistics,
namely measures of central tendency and dispersion. This chapter serves as the
basis for the following one. 

Chapter 8 is entitled “Notions of Statistical Testing”. As this title
indicates, Desagulier presents some basic concepts of statistical thinking and
inferential statistics. He starts with probabilities, and then explains the
key notions of populations, samples, individuals; random variables, dependent
and independent variables. Next, he covers hypothesis testing and probability
distributions. He concludes the chapter with some important statistical tests,
namely the chi-square (χ2) test, Fisher’s exact test of independence and
correlation.

Chapter 9, “Association and Productivity Measures”, starts with an
introduction, discussing the role of frequency in the generativist vs.
usage-based traditions and outlining the evolution of the concept of frequency
from the first to the second-generation usage-based linguistics. The chapter
covers co-occurrence phenomena (collocation, colligation, collostruction). It
presents association measures which quantify the attraction or repulsion
between two co occurring linguistic units, including asymmetric association
measures positing a directional dependency between collocates. The chapter
concludes with a section on lexical richness and productivity, covering issues
such as type-token ratio and vocabulary growth curves. 

The last chapter of the book, Chapter 10, is on “Clustering Methods”.
Desagulier presents five clustering techniques: Principal Component Analysis,
t-distributed Stochastic Neighbor Embedding, Correspondence Analysis, Multiple
Correspondence Analysis and Hierarchical Cluster Analysis. He explains the
principles of each analysis and illustrates applications with case studies. 
Finally, the chapter also covers cluster dendrograms and network graphs.  

EVALUATION

“Corpus Linguistics and Statistics with R” is a very well-written and
well-organized introductory book. Its contents are clear and readable. The
level of complexity of the text increases gradually from basic to quite
advanced. Each chapter begins with an abstract and most chapters have an
introductory section. This helps the reader contextualize the contents of the
relevant chapter. Furthermore, most chapters have their separate references in
addition to the full bibliography at the end of the book. The references are
sound and comprehensive. 

One of the main strengths of this book is that it constitutes an elaborate,
step-by-step manual for practical implementations of the contents. It enables
the reader to engage in hands-on applications of the methods presented. The
rich online supplements include data sets and R codes, making it possible for
the reader to work interactively. Moreover, the exercises at the end of the
chapters (with the solutions at the very end) offer additional study material.
 

R being already a flexible and versatile tool, Desagulier makes the life of
the reader even easier by providing separate instructions for Windows and Mac
users. When several R packages are available for a particular purpose, he
lists them all and mentions his own preference, explaining his reasons. Also,
he cites relevant websites and recommends references for further reading
throughout the book.

Another asset of the book is the author’s fluent style. In addition to being
articulate in his writing, he makes subtle jokes and references to popular
culture (to Star Trek, for instance) and uses catchy examples such as a
concordance of words based on ‘blood’ in the novel “Dracula” to keep the
reader interested while reading a technically demanding text. Furthermore, he
uses figures efficiently to explain his points. For instance, Figure 6.5 (p.
120) is an excellent example demonstrating the rationale of a word cloud, as
it consists of a word cloud of the novel “Moby Dick”, in which the word
‘whale’ is strongly emphasized. Two other examples for such clever use of
visuals are Figure 10.1 (p. 117), a phylogenetic tree by Darwin to illustrate
a dendrogram and Figure 10.2 (p. 118) with the Eiffel Tower viewed from four
different angles to explain the logic behind visualization in clustering
methods. From time to time, Desagulier also appends interesting information
about how certain tools and methods have been developed (for example, in p.
132, where he mentions the recent history of motion charts).

More importantly, Desagulier uses examples from linguistically relevant
topics, case studies and actual data from his own and others’ previous
studies. For example, he refers to his study on pre-adjectival vs.
pre-determiner uses of the intensifiers ‘quite’ and ‘rather’ in the British
National Corpus (BNC) (Desagulier 2015) both in Chapter 8 (p. 160) to explain
the notion of hypothesis testing and in Chapter 9 (p. 270) as a case study
illustrating Multiple Correspondence Analysis. In his discussion of normal
distribution, Desagulier also uses a data set from a real-life lexical
decision task on the auditory processing of German compounds by Isel, Gunter
and Friederici (2003).

In this book, Desagulier takes on a triple challenge. He aims to introduce the
basics of the R language, statistics and corpus linguistics in one book. He is
successful in this ambitious endeavor, which is the greatest strength of the
book. He also manages to increase the level of complexity of the contents
smoothly across chapters. In addition to the detailed methodological
instructions, Desagulier provides some theoretical background in various
sections of the book, as well. Chapters 1, 2, 3 in Part I and Chapters 6 and 8
in Part II are appropriate for even complete beginners. Chapters 4, 5, 6 and
Chapters 9 and 10 are more advanced, but still accessible. Even though most of
the methodology is presented from the perspective of corpus linguistics, some
or all chapters of the book may also appeal to researchers from other related
fields such as computational linguistics and psycholinguistics.

There are only two minor shortcomings of the book. First, even though a
certain number of typos are expected or probably unavoidable in any text,
there are slightly more typos in “Corpus Linguistics and Statistics with R”
than one would expect in such a meticulously crafted book. Just to give a few
examples: In the first sentence of page 44, ‘is’ should read ‘if’ in “…another
thing is the…”, “The Bank of English” is printed twice in the last sentence of
the third paragraph of section 3.1 on page 51, and “three sentence” in the
third paragraph of section 4.3.2 on page 72 should be plural. There are some
more such typos missed in proofreading, but these can easily be corrected in
future editions or in errata.  

The second minor point of criticism is the absence of a general conclusion
chapter or section. Although the book is written in a text-book format mainly
focusing on methodology, it also contains some theoretical aspects. Therefore,
a closing section to wrap up especially the theoretical discussion could have
helped the reader to put everything in better perspective. In the absence of
such a conclusion, there is a risk that readers might feel left in suspense.  

To conclude, this clearly written, coherent book with linguistically relevant
examples, data sets and R codes, is an inspiring resource for theoretical
linguists who wish to familiarize themselves with quantitative methods and
statistics. In the present era of big data, this book is a very timely and
valuable contribution to the literature. I strongly recommend Guillaume
Desagulier’s “Corpus Linguistics and Statistics with R” to anyone interested
in learning about R, statistics and the concepts and methods of corpus
linguistics.

REFERENCES

Chomsky, Noam. 1957. Syntactic structures. The Hague: Mouton.

Desagulier, Guillaume. 2015. Forms and meanings of intensification: A
multifactorial comparison of ‘quite’ and ‘rather’. Anglophonia 20.
doi:10.400/anglophonia558. http://anglophonia.revues.org/558.

Gries, Stefan Thomas. 2009. Quantitative corpus linguistics with R: A
practical introduction. New York, NY: Routledge.

Isel, Frédéric, Thomas C. Gunter & Angela D. Friederici. 2003.
Prosody-assisted dead-driven access to spoken German compounds. Journal of
Experimental Psychology 29(2). 277–288. doi:10.1037/02787393.29.2.277.

Langacker, Ronald W. 1987. Foundations of cognitive grammar: Theoretical
prerequisites, Vol. 1. Stanford: Stanford University Press.

R Core Team. 2013. R: A language and environment for statistical computing. R
Foundation for Statistical Computing, Vienna, Austria.
http://www.R-project.org/.


ABOUT THE REVIEWER

Gözde Mercan is a psycholinguist with a PhD in Cognitive Science from Middle
East Technical University, Ankara, Turkey. Her research focuses on the
processing and mental representation of language, mainly through the
structural priming paradigm. She has conducted structural priming experiments
on various linguistic forms in Turkish, English and Norwegian with monolingual
and multilingual participants. She is also interested in language acquisition
in children and adults. Currently, she is an (external) affiliate of the
Center for Multilingualism in Society across the Lifespan of the University of
Oslo.





------------------------------------------------------------------------------

*****************    LINGUIST List Support    *****************
Please support the LL editors and operation with a donation at:

              The IU Foundation Crowd Funding site:
       https://iufoundation.fundly.com/the-linguist-list

               The LINGUIST List FundDrive Page:
            https://funddrive.linguistlist.org/donate/
 


----------------------------------------------------------
LINGUIST List: Vol-29-4696	
----------------------------------------------------------






More information about the LINGUIST mailing list