32.2625, Review: Text/Corpus Linguistics: Egbert, Larsson, Biber (2020)

The LINGUIST List linguist at listserv.linguistlist.org
Wed Aug 11 20:29:59 UTC 2021


LINGUIST List: Vol-32-2625. Wed Aug 11 2021. ISSN: 1069 - 4875.

Subject: 32.2625, Review: Text/Corpus Linguistics: Egbert, Larsson, Biber (2020)

Moderator: Malgorzata E. Cavar (linguist at linguistlist.org)
Student Moderator: Jeremy Coburn, Lauren Perkins
Managing Editor: Becca Morris
Team: Helen Aristar-Dry, Everett Green, Sarah Robinson, Nils Hjortnaes, Joshua Sims, Billy Dickson
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org

Homepage: http://linguistlist.org

Please support the LL editors and operation with a donation at:
           https://funddrive.linguistlist.org/donate/

Editor for this issue: Jeremy Coburn <jecoburn at linguistlist.org>
================================================================


Date: Wed, 11 Aug 2021 16:29:31
From: Tyler Anderson [tanderso at coloradomesa.edu]
Subject: Doing Linguistics with a Corpus

 
Discuss this message:
http://linguistlist.org/pubs/reviews/get-review.cfm?subid=36718197


Book announced at http://linguistlist.org/issues/32/32-253.html

AUTHOR: Jesse  Egbert
AUTHOR: Tove  Larsson
AUTHOR: Douglas  Biber
TITLE: Doing Linguistics with a Corpus
SUBTITLE: Methodological Considerations for the Everyday User
PUBLISHER: Cambridge University Press
YEAR: 2020

REVIEWER: Tyler Kimball Anderson, Colorado Mesa University

SUMMARY 

Egbert, Larsson & Biber’s booklet ‘Doing linguistics with a corpus:
Methodological considerations for the everyday user’ discusses means to
improve investigations where the main research tool is a corpus. It can be
said that the authors’ main goal is to put the linguist back into corpus
linguistics. Among other factors to reach this goal, they discuss how to
improve research methods, research design, and selection of appropriate
research questions. In other words, they want all practicing corpus
linguists—seasoned and novice alike— “to take control of their research while
also employing available resources” (p. 2). They further argue for the need of
qualitative interpretations of quantitative data and adopting minimally
sufficient statistical methods.

To begin their discussion, the authors briefly lay out their goals in Section
1 by comparing corpus linguists to the everyday driver. Neither require
expertise in engineering to use the tools at their exposure, but Egbert,
Larsson & Biber propose that by having a basic understanding of what goes on
with their vehicles will help both avoid problems. The authors postulate that
“just understanding the nature and composition of the corpus used for
analysis…can be of tremendous assistance when conducting and interpreting
corpus analyses” (p. 1-2). Section 2 is titled ‘Getting to know your corpus’
and here they begin by discussing the importance of corpus size, concluding
that given two equally designed corpora, the larger one will always be better
because it will provide more word and phrase types that would not be
represented in smaller corpora. However, they recognize that finding two
equally designed corpora is unlikely, and thus researchers need to likewise
consider ‘representativeness.’ Here linguists must assure that the corpus
includes texts that are as representative as possible of the target population
they are interested in studying. Thus, they should always find (or compile) a
large corpus with texts that are based on the goals of their study. In all
cases, argue the authors, researchers should read and critically examine any
documentation associated with the corpus (looking for information about the
texts themselves) and analytically evaluate the texts in each corpus. 

Transitioning to Section 3, the authors target “how quantitative corpus
analyses relate to tangible linguistic descriptions” (p. 15). Here they delve
into research design and the development of research questions. According to
Egbert, Larsson & Biber, when it comes to research design, researchers who
utilize corpora must decide whether they will be investigating linguistic
tokens or entire texts. Similarly, research questions can center on analyzing
the factors that predict structural variations, or investigating what they
call descriptive linguistics, which they define as describing the linguistic
features of the texts. They also discuss the topic of dispersion, where texts
are analyzed to discover how uniformly a given linguistic feature is
distributed across the corpus. In Section 4, the authors attempt to emphasize
the need for linguistically interpretable variables in all corpus linguistic
studies. Likewise, here they focus on the need to have clear operational
definitions for these linguistic variables. 

In Section 5 ‘Software tools and linguistic interpretability’, the authors
discuss some of the pitfalls of commonly used tools in corpus linguistics, and
how researchers can make advancements in the field to circumvent these
pitfalls. Of importance is the idea that all results should be tested for
precision without taking results at face value. They illustrate this by
showing how three different tools used for linguistic annotation (i.e.,
Stanford Dependency Parser, Malt Parser, and Biber Tagger) all exhibited
different rates of precession when it came to tagging noun-noun sequences. In
a similar vein, Section 6 focuses on ‘The role of statistical analysis in
linguistic descriptions,’ where the emphasis is placed on using minimally
sufficient statistical methods. Here the authors warn against overreliance on
statistical paradigms solely for the sake of using a specific statistical
model. Here they illustrate how many researchers rely on the null hypothesis
paradigm, which is extremely sensitive to sample size. In such studies
statistical significance could be shown with any measurable difference simply
due to having a large corpus. Regardless of what statistical package is
implemented, the authors argue that qualitative linguistic analysis is always
required in addition to the quantitative analyses. Indeed, Section 7 centers 
on ‘Interpreting quantitative results.’ Here it is stressed that “linguistics
is done by linguists, not by computers” (p. 52). Utilizing the results of
statistical tests, researchers should continue to provide sound qualitative
analyses, which is facilitated by the abundance of linguistic context found in
corpora. They advocate for linguists to evaluate closely a subset of the texts
that have been submitted to statistical analysis, as well as concentrating
meticulously on the text-external contexts (also discussed in Section 2).

The manuscript concludes with Section 8, wherein Egbert, Larsson & Biber
provide a summary of the main points of the booklet. It is important to
mention that each of the main sections (2-7) contain one or more case studies
that attempt to illustrate each of their principal points. For example, in
Section 2 the case study provides a breakdown of two corpora (i.e., COCA
Academic and BNC Academic) and how on the surface each are comparable;
however, a deeper dive into these corpora reveals stark differences in
composition that impact the interpretations of the results. 

EVALUATION 

This Element from Egbert, Larsson & Biber is an overall positive addition to
the field of corpus linguistics. Perhaps the greatest contribution of this
work is their proposal to return linguistics back to the forefront of the
field of corpus linguistics. Beginning with the title’s focus on “Doing
linguistics,” the authors show that a corpus is a useful tool to help
linguists analyze texts, and not the final word in the analytical process.
With few exceptions, they exemplify each of their topics skillfully with a
variety of case studies. Perhaps the most important of these came in the
Section 7, where they provide three case studies that illustrate the
complexity of qualitatively interpreting quantitative results. For example, in
case study 1 (Section 7.2) the research question deals with what adjectives
collocate with the nouns ‘man’ and ‘woman’ and what differences are seen. They
show how doing a deeper dive beyond the output given by the concordancer
provided answers to questions that went beyond the authors’ initial
intuitions. They had postulated that the adjective ‘American’ collocated more
frequently with ‘woman’ because of the popular song by Lenny Kravitz with that
same title. However, by going beyond the frequency-based results and examining
a subset of examples, the researchers showed that the majority of these
examples dealt with minority groups (e.g., ‘Native American woman’). These
case studies were a strong addition to the tome. 

As with any book, this manuscript has some weaknesses that impeded the authors
from reaching several of their goals. First of all, their analogy of getting
to know what is ‘under the hood’, while appropriate, did not bare fruit. At no
point in the manuscript did the present reader feel like he understood more
fully what is ‘under the hood’ when it comes to corpora. For example, in
Section 2 they encourage the compilation of a new corpus for every study
without discussing how to perform such a feat, only providing a reference to
another work. If their target audience is the novice, more information on how
to accomplish this task should be provided here. Similarly, they discuss the
pitfalls of reusing publicly available corpora, but don’t discuss the
familiarity and trust that some of the most widely used corpora (e.g., COCA,
BNC, etc.) would generate over a self-generated corpus. And later they discuss
the option of researchers developing their own software programs (p. 33);
arguably, the group of researchers that can carry out such a task is minimal,
and one who can is probably not diving into this tome. 

At some points terminology was not consistent. In Section 2, for example, the
authors discuss the development of research questions and research designs,
two distinct concepts. However, in exemplifying these concepts they talk about
“one major…research question” (p. 16) followed by a “second major type
of…research design” (p. 17), as if the terms were interchangeable. And even
the use of ‘Section’ was a bit confusing and begged the question of why they
were called sections and not chapters. 

While the booklet was well written overall, there were a few points where the
authors did not make appropriate transitions, especially between sections. For
example, between Section 4 and Section 5 no connections are made between the
topic of software tools and the case study provided for linguistically
interpretable variables. Similarly, definitions of terms oftentimes went
missing. For example, in Section 4 it discusses ‘employing MI scores’ but
fails to indicate what these are or how to do such a task. Perhaps this is
because Mutual Information (MI) “is one of the most popular statistical tests
that corpus linguists use to explore collocations” (Szudarski, 2018, p. 77);
however, it should not be taken purely for granted by the authors that such
information will be readily understood by the inexperienced members of their
target audience. Similarly, case study 2 in section 7.3 failed to clearly
explain some key terms (i.e., multidimensional (MD) analysis). In fact, they
state that this type of analysis is “a classic example of a complex
statistical technique that can create distance between a researcher and
language data” (p. 57). If that is the case, it begs the question of why it
was included in light of their discussion of ‘minimally sufficient
statistics.’ And if they deem it necessary to include, they must further
explain the topic for those readers who have never seen such an analysis. 

In a similar vein, the topic of accuracy level is discussed in Section 5. They
recommend that researchers always carry out such measures of accuracy
(including precision and recall), but fail to explain how such a measure can
be carried out. Again, if it is important to be placed in the book, it should
be illustrated on how to carry out such tasks. 

Also, an apparent oversight was found in the conclusion of the book. Here they
reference a blog titled “Linguistics with a corpus,” but fail to point the
readers to where they can find it. But perhaps the most glaring shortfall of
this booklet came in Section 4. The proposed goal of this chapter was to
ensure that all variables used in a corpus study fit the guideline of being
linguistically interpretable. However, their case study—on measures of
collocation—has no apparent connection to this goal. 

These shortcomings aside, the booklet provides some great insights in to how
to improve research for linguists interested in using corpora as tools for
language analysis. As previously mentioned, the emphasis on inviting linguists
back to their own party is well merited. In a data driven world, Egbert,
Larsson & Biber’s focus on using just enough statistical analyses to get
answers is also a refreshing addition to the field. 

REFERENCES

Szudarski, Paweł (2018). Corpus linguistics for vocabulary: A guide for
research. Routledge.


ABOUT THE REVIEWER

Tyler K. Anderson is Professor of Spanish at Colorado Mesa University, where
he teaches courses in language, linguistics and second language acquisition.
His research interests include language attitudes toward manifestations of
contact linguistics, including the acceptability of lexical borrowing and
code-switching in Spanish and English contact situations. He is currently
researching loanwords and core vocabulary using corpus linguistics.





------------------------------------------------------------------------------

***************************    LINGUIST List Support    ***************************
 The 2020 Fund Drive is under way! Please visit https://funddrive.linguistlist.org
  to find out how to donate and check how your university, country or discipline
     ranks in the fund drive challenges. Or go directly to the donation site:
                   https://crowdfunding.iu.edu/the-linguist-list

                        Let's make this a short fund drive!
                Please feel free to share the link to our campaign:
                    https://funddrive.linguistlist.org/donate/
 


----------------------------------------------------------
LINGUIST List: Vol-32-2625	
----------------------------------------------------------






More information about the LINGUIST mailing list