30.253, Review: Linguistic Theories; Text/Corpus Linguistics: Kopaczyk, Tyrkkö (2018)

Wed Jan 16 19:07:28 UTC 2019

LINGUIST List: Vol-30-253. Wed Jan 16 2019. ISSN: 1069 - 4875.

Subject: 30.253, Review: Linguistic Theories; Text/Corpus Linguistics: Kopaczyk, Tyrkkö (2018)

Moderator: linguist at linguistlist.org (Malgorzata E. Cavar)
Reviews: reviews at linguistlist.org (Helen Aristar-Dry, Robert Coté)
Homepage: https://linguistlist.org

Please support the LL editors and operation with a donation at:
           https://funddrive.linguistlist.org/donate/

Editor for this issue: Jeremy Coburn <jecoburn at linguistlist.org>
================================================================

Date: Wed, 16 Jan 2019 14:07:02
From: Brett Drury [brett.drury at gmail.com]
Subject: Applications of Pattern-driven Methods in Corpus Linguistics

Discuss this message:
http://linguistlist.org/pubs/reviews/get-review.cfm?subid=36427159

Book announced at http://linguistlist.org/issues/29/29-1552.html

EDITOR: Joanna  Kopaczyk
EDITOR: Jukka  Tyrkkö
TITLE: Applications of Pattern-driven Methods in Corpus Linguistics
SERIES TITLE: Studies in Corpus Linguistics 82
PUBLISHER: John Benjamins
YEAR: 2018

REVIEWER: Brett Mylo Drury

This book addresses an approach to corpus linguistics which is based upon the
frequentist school of statistics where the corpus contains all the linguistic
phenomena that need to be discovered, and it is assumed that the discovered 
phenomena will also be discovered with the same tools in other sufficiently
large similar corpora. This approach eschews any a priori probabilities or
assumptions about linguistic patterns. 

The main outcome of the frequentist approach is the discovery of lexical
bundles. Lexical bundles “are groups of words that occur repeatedly together
within the same register”. Consequently the majority of the papers in this
book are concerned with either the discovery of lexical bundles in different
genres of corpera, or techniques to discover lexical bundles.

The book is grouped into 4 parts: Introduction, Methodological explorations,  
Patterns in utilitarian texts and Patterns in online texts. With the exception
of the Introduction, each part contains a collection of 2 or more papers that
are loosely grouped around the main theme of the section.

The approach of producing a book around a group of research papers is both a
strength and weakness of this book. The research papers address subjects that
are likely to be omitted from a normal academic textbook; however the flow of
the book can be a little “choppy,” as the flow from one paper to another is
not smooth. However, this is a small complaint. In general this book is well
written and addresses a relatively new area of research in corpus linguistics.

INTRODUCTION

The introduction of the book is a single paper by the book’s editors: Joanna
Kopaczyk and Jukka Tyrkko. The aim of this paper is to frame the frequentist
approach in the history of corpus linguistics as well as to introduce the
context of the remainder of the book. And in general the authors achieve this
aim. They justify the frequentist approach by noting the growth of large
corpora which the authors dub mega-corpora. They note that the rapid
appearance of these types of corpora come at a cost, which is a lack of
curation. This lack of curation and lack of associated metadata may cause
errors with traditional analysis techniques, but conversely these types of
corpora lend themselves to the frequentist pattern-based techniques. 

The chapter provides a distinction between data-driven, and corpus driven
methods. And the editors place pattern driven methods in the data-driven camp.
The authors differentiate the two methods by describing corpus methods as
knowledge based where a priori assumptions are made about language, whereas
data driven methods make no such assumptions. The editors also provide a
historical context for corpus driven methods as well as the methods’ flaws.

The remainder of the chapter provides an overview of the structure of the
book, as well as suggestions for future of pattern driven research.  

Part I. Methodological explorations

This section of the book contains three chapters that are loosely grouped
around the central aforementioned theme. The chapters are:  “From Lexical
Bundles to Surprisal and Language Models”, “Fine Tuning Lexical Bundles” and
“Lexical Obsolescence and Loss in English”. The common theme in each of the
chapters was the use of various techniques to identify lexical bundles, and in
the last chapter, the decline of words and lexical bundles in English over
time.

The first chapter, From lexical bundles to surprisal and language models,
asserts that sequences of words are the fundamental building blocks of
discourse. And these building blocks allow speakers to gain fluency, as well
as gaining the ease of understanding. The authors assert that these blocks
allow native speakers to understand language with partial or missing
information. Under these circumstances native speakers fill in the missing
information. The authors suggest that pattern analysis allows the detection of
frequently occurring information in a corpus that will be relied upon by
native speakers to fill in the gaps in partially constructed phrases. 

The chapter describes a number of experiments on the British National Corpus
using a number of various statistical measures to identify lexical bundles, 
as well as information theoretic measures of surprisal. These measures are
used to identify the differences between different types of language use, and
“learner language at different levels”. The experiments with the traditional
types of association measures such as frequency and T-score produced a ranking
of 4-grams. It was surprising that the tokenizer used by the authors included
punctuation and non-word characters as words. In addition the authors also
ignored sentence boundaries in their 4-grams. There is some ad-hoc commentary
on the results. It would have helped the paper if there were some in depth
analysis. 

The chapter then introduces the surprisal measure.  The authors provide a
comparison of genre with a graphical plot bigram frequency and surprisal, and
this is repeated with the JLE but this time with trigrams. The graphs are
accompanied by explanations of the distributions. I think it would have helped
the chapter if there were statistical analysis, such as Wald-Wolfowitz
two-sample runs test of the differences in the distributions. In addition to
the graph analysis the authors provide a comparison of  collocations of
lexical bundles within a syntactic frame. The analysis is in the form of
graphs and data in tables. This follows the same format of non-statistical
explanation. Finally the chapter provides some experiments of detecting
surprisal using POS and Tree taggers.   

The chapter was a little overlong, and could have benefited from a more
rigorous statistical analysis of the results rather than a verbose
explanation. 

Chapter 3, is a more a concise affair than Chapter 2. Its clearly stated aim
is to finally tune lexical bundles. The author’s main complaint is that
lexical bundles tend to be incomplete, and a part of a larger sequence of
words. And in addition current methods such as MI are not suitable for
detecting complete lexical bundles. The author suggests that transitional
probabilities of the components of a lexical bundle as well as transitional
probability of the lexical bundle itself is a method of detecting the complete
lexical bundle. The author provides some basic experiments of the type of
lexical bundles returned by this method. The author uses another technique
called formulex, but does not explain the method. The author also suggests
some methods to select bundles to represent a corpus. These methods are simply
sampling strategies such as stratified sampling. The author applies these
techniques using WordSmith on the DrugDDI corpus.       

The final chapter in this section is: lexical obsolescence and loss in
English. The intention of this paper is to track the variation in the
frequency  of multi-word expressions over time. The chapter introduces a
measure which they call Obsolescence Index (OI). OI is simply a moving index
of the adjusted frequency (AF) of a MWE. The time interval used in the
calculation is a decade. It would have helped the paper if there had been a
mathematical representation of  AF as well as the other frequency terms
referred to in the paper.  The experiments conducted by the author computed
MWEs using Mutual Information (MI), and then sorted them by OI. The
experiments produced a Pivot Table. The table graphic is a little unclear. The
author claimed that the table was used to chose some examples for further
analysis.  The remainder of the chapter provided some comparative graphical
plots of the relative frequency of unigrams and trigrams overtime. 

Part II. Patterns in utilitarian texts

This section of the book contains three chapters that are loosely grouped
under the above theme. 

The first chapter proposed the use of Part of Speech (POS) grams to extract
phrases from newspaper texts.  The approach identifies sequence of POS grams
to identify multi-word expressions. The rationale forwarded by the authors of
the approach is that it will detect phrases that will be ignored by word level
approaches. The sequences of POS tags are extracted, and the chi-square test
is applied to see if the sequences are formed by chance. If the null
hypothesis can be rejected the sequence is accepted. This trick of using a
lower dimensional space to extract patterns is well known to NLP
practitioners. The remainder of the paper discusses the results where example
POS patterns from different genres are shown with a sample of the phrase for
that POS pattern. 

The second paper in this part of the book extracts semantic sequences from
legal judgements. Semantic sequence is a sequence of themes which are
distributed across word and phrases. The approach described by the author
starts his analysis by using an “N that” pattern. In this paper the author
uses the pattern to identify a number of phrases associated with a noun. The
author selects 8 nouns without justification for further analysis. The
remainder of the chapter discusses each of the selected nouns and the role
that they play in judicial judgements. The conclusion that is drawn by the
author is that there are 5 main roles for nouns extracted in judicial
judgements by the “n that” pattern. The roles are: Evaluation, Cause, Result,
Confirmation and Existence.  The author finally concludes the chapter by
discussing the role of semantic sequences in judicial judgements.

The last chapter in this section is the identification of lexical bundles in
English acts of parliament. The chapter follows a similar outline to the
previous chapters where lexical bundles are explained as well as the source
material, and the proposed approach. 

The initial analysis is similar to the previous chapters where raw frequency
of the lexical bundles is demonstrated and explanations are offered about the
function of some of the lexical bundles. The chapter then focuses on the
function of lexical bundles in legal writing.  The author categorizes the
bundles into categories and explains their role in legal writing. The author
claims that the referential lexical bundles are used heavily because legal
writing will refer to other legislation and legal actors. The remainder of the
chapter compares grammatical distribution of lexical patterns in the 16th and
17th century, as well as providing a discussion and conclusion.   
Part III Patterns in online texts

The 1st chapter in this section concerns lexical bundles in Wikipedia
articles. The structure of this chapter is similar to others. The data sources
in this case are Wikipedia articles, and their more formal equivalents. These
equivalents  will be used as a comparison to the Wikipedia articles. The first
analysis compares lexical bundles of various length across genres (economics,
medicine and literary criticism). . The chapter moves on to analyse the genres
individually against the aforementioned comparison corpera. The chapter
concludes with comments about the Wikipedia articles. Most importantly, the
chapter concludes that Wikipedia is missing details about the experimental
process, and contains “undisputable facts”.  The implicit finding of the
author is that Wikipedia falls short of what is required of a complete
academic resource.  

The second chapter in this section deals with repetition in marketing texts. 
The chapter starts with a definition of marketing texts, as well as a list of
the differing varieties. The authors limit the marketing texts that they will
analyze to the legal domain. The chapter provides some brief analysis on the
characteristics of the corpus. The chapter then follows the familiar pattern
of describing lexical patterns and the related research. The chapter then
provides some basic comparative analysis of normalized frequency of lexical
bundles across the sub-copera of the main legal corpus. And in common with a
number of previous chapters the author provides examples of the most frequent
lexical bundles. And finally the chapter concluded that the legal bundles
varied the most in legal writing sub-corpus and the least in the marketing
sub-corpus. They stated that this was due to the templates used in marketing
emails.  

The penultimate chapter of the book is an analysis of lexical bundles
contained in blogs written in American English. The chapter starts with the
familiar approach of defining lexical bundles, and describing the source
material as well as the methodology of extracting lexical bundles from the
target corpus. The chapter describes the different types of analysis
performed. 

The chapter describes a frequency based approach to lexical bundles where four
word lexical bundles that appeared 20 times and appeared in five separate
blogs were  selected for further functional and grammatical analysis. The
functional analysis  classified the lexical bundles into: stance expressions,
discourse organizers and referential expressions. The grammatical analysis
compared grammatical patterns across the lexical patterns.  From this analysis
the chapter concluded that blogs rely heavily upon stance and first person
references lexical bundles. The author concluded that the profile of blogs is
more similar to spoken language than typical written language. 

The final chapter in the book is another chapter about lexical patterns in
blogs. This chapter differs slightly from the majority of chapters because it
describes the differences in various dialects of English. The chapter then
returns to the familiar pattern  of corpus description, and lexical pattern
extraction. The initial analysis uses normalized frequency of 3-grams to
compare English dialect sub-corpora.  This approach is extended to use a
similarity measure to compute similarity between English dialects. It should
be no surprise that authors discovered that countries that have similar
cultures and racial makeups have they highest similarity between their English
dialects. The experiment was repeated with a Hierarchical clustering. And
again countries with similar cultures were grouped together based upon the
linguistic similarity of their dialects of English.  The chapter has a number
of visualizations of intersections of English dialects, as well as their
common n-grams. 

EVALUATION

In general this is an interesting collection of papers which will give the
interested reader an in-depth introduction to pattern driven corpus
linguistics. The approach of the book to use a collection of academic papers
is both a strength and a weakness. The strength of this approach is that the
subject matter is addressed in depth, but the weakness is that there is
significant  redundancy in the subject matter and the flow of the book is
choppy. There is also something missing from the book. Nearly all of the
papers are descriptive. Lexical bundles describe the nature of the language of
a document collection. The nature of the language should provide an indication
of how language impacts the subject area. This is hinted at in Chapter 6 where
the reasoning language of Supreme Court Justices is analyzed. The natural
extension of this work would be to estimate the initiate bias of this language
upon judicial decisions. These criticisms aside, this book is worth the read
for people wishing to have a grasp of the state of the art in the field of
data driven corpus linguistics.

ABOUT THE REVIEWER

Brett is currently the Head of Research at Scicrop where he is trying to solve
agricultural problems using linguistics. He is particularly interested in
building knowledge models from information in text.

------------------------------------------------------------------------------

*****************    LINGUIST List Support    *****************
Please support the LL editors and operation with a donation at:

              The IU Foundation Crowd Funding site:
       https://iufoundation.fundly.com/the-linguist-list

               The LINGUIST List FundDrive Page:
            https://funddrive.linguistlist.org/donate/

----------------------------------------------------------
LINGUIST List: Vol-30-253	
----------------------------------------------------------