27.5068, Review: Applied Ling; Text/Corpus Ling: Crawford, Csomay (2015)

The LINGUIST List via LINGUIST linguist at listserv.linguistlist.org
Mon Dec 12 10:43:46 EST 2016

LINGUIST List: Vol-27-5068. Mon Dec 12 2016. ISSN: 1069 - 4875.

Subject: 27.5068, Review: Applied Ling; Text/Corpus Ling: Crawford, Csomay (2015)

Moderators: linguist at linguistlist.org (Damir Cavar, Malgorzata E. Cavar)
Reviews: reviews at linguistlist.org (Helen Aristar-Dry, Robert Coté,
                                   Michael Czerniakowski)
Homepage: http://linguistlist.org

*****************    LINGUIST List Support    *****************
                       Fund Drive 2016
                   25 years of LINGUIST List!
Please support the LL editors and operation with a donation at:

Editor for this issue: Clare Harshey <clare at linguistlist.org>

Date: Mon, 12 Dec 2016 10:43:37
From: Mariana España-Rivera [mespana at staff.uni-marburg.de]
Subject: Doing Corpus Linguistics

Discuss this message:

Book announced at http://linguistlist.org/issues/26/26-4902.html

AUTHOR: William J. Crawford
AUTHOR: Eniko  Csomay
TITLE: Doing Corpus Linguistics
PUBLISHER: Routledge (Taylor and Francis)
YEAR: 2015

REVIEWER: Mariana España-Rivera, Philipps-Universität Marburg

Reviews Editor: Robert A. Cote


The book ''Doing Corpus Linguistics'' (DCL) by William J. Crawford and Eniko
Csomay offers a practical hands-on introduction to the growing field of Corpus
Linguistics (CL). Intended as an introductory guide for university-level
students in Applied Linguistics, it briefly explains how to carry out a
complete corpus-based project from a framework of Register Analysis (Biber and
Conrad 2009; p.16ff). 

Corpus Linguistics is an emerging area of study, a cross-field area of studies
in which both, qualitative and quantitative approaches meet. Concerned with
understanding how people use language in various contexts, it uses a «corpus»
or collection of texts -of written or oral language- that are analysed
collectively to make statements about the use of language (p.6). While a
prescriptive approach to language studies traditionally focuses on providing
guidelines or rules to dictate language use, CL aims to provide the linguistic
researcher with the necessary methodologies and digital tools to conduct a
descriptive approach which can then be useful for ''uncovering'' naturally
occurring language patterns as well as enabling the researcher to evaluate
''how prescriptive rules are followed by language users'' (p.5).  

To gain practical skills through register analysis and help students with
handling online corpora, the authors provide many different clearly formulated
problems with practice-oriented solutions. To follow the exercises one should
register with COCA (Corpus of Contemporary American English).

The book is divided into three Parts and nine chapters. Part 1: ''Introduction
to Doing Corpus Linguistics and Register Analysis''; Part 2: ''Searches in
Available Corpora'', and Part 3: ''Building Your Own Corpus, Analyzing Your
Quantitative Results, and Making Sense of Data''. 
Part 1 contains two sections related to present basic concepts of
“Linguistics, Corpus Linguistics, and Language Variation” (Chapter 1), and
explains the relevant aspects of  “Register Analysis” (Chapter 2). Part 2
comprises two chapters that introduce the essentials of how to search through
on existing corpora (Chapter 3 “Searching a Corpus”) and provides examples of
“Projects Using Publicly Available Corpora” (Chapter 4). Part 3 is divided
into five chapters. They illustrate how to build a corpus (Chapter 5 “Building
Your Own Corpus”), introduce basic concepts related to statistical analysis
(Chapter 6 “Basic Statistics” and Chapter 7 “Statistical Tests”), provide some
guidelines to elaborate on a research project (Chapter 8 “Doing Corpus
Linguistics”). The closing section offers ideas on how to develop a deeper
understanding of this topic (Chapter 9 “A Way Forward”). 

The book includes a Preface, Acknowledgments and a list of Tables and Figures.
Bibliographical references are listed at the end of each chapter. The book
closes with an index of names and relevant terms.

Chapter 1, “Linguistics, Corpus Linguistics, and Language Variation”, focuses
on the study of natural language data, i.e., language as it is used in
different contexts and produced for purposes ''other than linguistic
investigation'' (p.8). In this section, the authors introduce some key
concepts of CL: Language variation, Collocation and Frequency. According to
Biber, Conrad, and Reppen (1998; p.8), corpus research is characterised by the
following elements: it is empirical and it utilises a large and principled
collection of natural texts which are analysed by means of automatic and
interactive techniques utilising  quantitative and qualitative analytical
techniques. Additionally, Tognini-Bonelli (2001; p.9) distinguished between
''corpus-based'' and ''corpus-driven'' research; the former based on
already-identified language features, the latter being based on extracting
lexical patterns from the corpus. She also refers to the ''vertical-analysis''
that a software program performs on a corpus, which locates many examples of a
particular language feature instead of reading them 'horizontally' or from
start to finish, as a human brain would do (p.9-11). 

Chapter 2, “Register Analysis”, illustrates the seven ''contextual variables''
identified by Biber and Conrad (2009; p.17). When describing language from the
perspective of a register analysis, we are basically taking into account the
social and cultural environment implicit within the context of real-world
usage of the speech community. These variables are related to: 1.
Participants; 2. Relations between participants; 3. Channel, i.e., Mode
(written, oral) and Medium (permanence of language); 4. Production
Circumstances (process or grade of planning); 5. Setting (time and place of
communicative event); 6. Communicative Purposes, which includes grade of
factuality or grade of expressed personal or subjective attitude about the
topic (e.g., to inform, to persuade or just to interact and share thoughts,
ideas or feelings); 7. Topic, a ''broad situational variable'' (p.20) that can
have an impact on the linguistic characteristics as well and should not be
confused with ''communicative purpose'' (p.17-19).

Chapter 3, “Searching a Corpus”, covers the basics of language units and
search tools needed to start exploring available online corpora. The four most
common lexical units that we identify in CL research are: 

1. Keywords in Context (KWIC), allows us to look for an individual word or a
group of pre-selected keywords. By default, a specialised software program
will give us the frequency of a word in the corpus, across registers and their
corresponding average. The results are usually displayed in the form of
highlighted concordance lines and therefore we can analyse  the patterns or
speech categories surrounding them. 

2. Collocates. 1951 Firth (p.40) coined the term «collocation» to refer to two
word combinations. We often find them in partially or fully fixed expressions
(e.g., «strong tea rather than *powerful tea», p.40)         

3. N-Grams are word-sequences or word combinations co-occurring where the
n-value denotes how many words there are in a unit (p.41). Some Four-Grams are
also called «lexical bundles» (Biber et al., 1999; p.49). On the basis of
their specific function, frequency and dispersion throughout different
registers, they can be studied as a unit (e.g., «in the case of») and their
position in the structure of discourse has recently been studied (Csomay 2013;

4. POS-Tags or ''Part of Speech-Tags''. A corpus can be either tagged for part
of speech or not. If tagged, POS-tags can help -independently of the actual
words- to look for their associated grammatical patterns or for co-occurring
grammatical patterns (p.53).

Chapter 4, “Projects Using Publicly Available Corpora”, introduces some
corpora developed by Mark Davies at Brigham Young University (BYU). Using
these corpora is free of charge; however registration is mandatory, and some
restrictions may apply (e.g., number of queries). Under Corpus.byu.edu we can
access different corpora that use the same graphical interface and partly
allow cross comparison searches.  

The BYU project collects corpora of different varieties of English: British
(British National Corpus: BYU-BNC), American (Corpus of Contemporary American
English: COCA), Canadian (Strathy Corpus), and other English varieties around
the world (Global Web-Based English: GloWbE); for diachronic studies we can
access historical texts from the early nineteenth century (Corpus of
Historical American English: COHA) and texts from the early twentieth century
(TIME Magazine Corpus) (p.58-59). 

Using twelve tasks of two different types: Word- & Phrase-Based and
Grammar-Based projects, we explore corpora and gain hands-on experience
conducting corpus research, interpreting data and presenting results. 

To summarise the findings in terms of working with different corpora and to
compare distributional patterns of language features across corpora or
particular registers, it is important to note that each corpus has its own key
characteristics in terms of size, number of registers (single/multiple),
situational variables, and time or period. To overcome the size differences of
corpora we use normalized counts as standard (p.59). 

With regard to Register Analysis, we need to consider the contextual factors
that form the central issue when analysing features within a particular
register, because situational differences can substantially differ even
between the same register. E.g., the BYU-BNC takes the spoken data from oral
histories, meetings, lectures, and doctor-patient interactions, which provide
distinctive features of ''interactional types of discourse'', whereas the COCA
examples, taken them from television, radio news, and information shows,
provide language features more closely related to ''informational types of
discourse'' (cf. Biber 1995; p.59) Thus, data obtained will differ in its
situational variables, which can significantly impact the results of
qualitative analyses. 
Part 3, “Building Your Own Corpus, Analyzing Your Quantitative Results, and
Making Sense of Data”, introduces smaller, less representative, specialised
corpora specifically designed to address narrower research topics. As a rule,
we assume that from these corpora, we can only draw conclusions for our own
dataset, so we rarely extrapolate our results (p.79).  

Depending on how specific the research question is, we may need to build our
own corpus. A practical guide for doing this is outlined in Chapter 5,
“Building Your Own Corpus”. The first step towards getting involved is to
clarify potential copyright issues related to the selection, the compilation,
and the storage of digital texts (p.76). By then, we will have clearly
identified the topic of research and framed it within a (set of) research
question(s) or ''hypothesis''. This is crucial since the interpretation of the
results largely depends on how clear and concise the research question(s) is
formulated (p.76-79). 

For a variety of reasons, a corpus research project will take ''a good deal of
time commitment'' (p.76).  Key aspects to consider when building a corpus are
as follows: LOCATE enough suitable texts that share the selection criteria or
''variables''. For the sake of frequency comparisons, if a corpus includes
SUB-CORPORA, it should be BALANCED, that is, they ought to be of an equal size
in terms of number of texts, total word count, and text types (p.79-80).
PREPARE the material by saving it in a plain text format, removing all
''meta-data'' so that a concordance software can easily identify textual
patterns. NAME your files with a coding scheme that allows you to identify
each as being part of a larger group. ADDITIONAL DATA that is not part of the
text analysis but is still relevant to the qualitative analysis (e.g., number
of words) can be added at the header of the text in angled brackets (< >);
then, the program will ignore them (p.81-84).

>From all the available software programs for CL research, the authors briefly
describe two. One is AntWordProfiler comprising of a Vocabulary Profile, File
Viewer, and Editor Tool to generate vocabulary statistics and frequency
information with no corpus of texts already loaded into the program: you must
upload your own. The other is AntConc, a Concordance Program for doing lexical
and grammatical analysis. Both were developed by Laurence Anthony of Waseda
University, Japan, and are available as freeware, Easy Install multi platform
tools, and can be downloaded under www.laurenceanthony.net/software.html. 

In Corpus Studies, quantitative analysis relies on statistical measures. Under
the conditions of experimental design, we can test our hypotheses and obtain
quantitative measures to measure how frequently a particular language feature
occurs in a particular dataset. In the following sections, we will look into
the basics of conducting descriptive statistics by which data already
collected will be tested with so-called parametric and non-parametric tests
with regard to Variance and Correlation (p.105ff).

Chapter 6, “Basic Statistics”, introduces the basic terminology used in every
statistical analysis: types of Variables, Functions, Scales and Values, and
explains their meaning with a number of practical examples. Chapter 7,
“Statistical Tests”, explains some statistical methods that have proven useful
and are frequently used in linguistic analysis: One-Way and Two-Way Analysis
of Variance (ANOVA), Chi-Square Tests of Frequency Tables, and (Pearson)
Correlation. Always keeping the target audience in mind, the authors provide
specific examples of how to apply these statistical tests to some real-life
case studies of linguistic data and how to interpret the results.  

Both chapters provide detailed step-by-step instructions that guide the reader
through the various statistical procedures firstly by using manual
calculations in order to understand how any statistical software package
performs them. Finally, we are introduced to the basics of working with SPSS
(Statistical Package for the Social Sciences) and learn how to organise and
enter data as well as present tables of descriptive statistics (p.109-116).   

Chapter 8, “Doing Corpus Linguistics”,. is dedicated to explaining how to put
into practice a register analysis framework following either a corpus-based or
corpus-driven approach to arrive at a functional interpretation of the results
(p.151). This section closes with a practical guidance on how to prepare a
written report of research results (p.152-155).

Chapter 9, “A Way Forward”, briefly summarises the key strengths and
weaknesses of corpus-based and corpus-driven studies. With regard to the
latter, the authors emphasise the increasing need of corpus researchers with
''computational and statistical skills to carry out more in-depth analyses''
(p.156). This is evident from the fact that when conducting an in-depth
analysis, we still need to look for different word types (e.g., 'concrete' or
'abstract' nouns; p.156) as tagged corpora usually include only basic
grammatical categories, and the capabilities of existing tagging software
still need to be manually improved.

As far as corpus-driven register studies are concerned, the authors refer to a
''multi dimensional analytical framework'' as a more suitable model for those
striving for a more comprehensive analysis (e.g., to describe language
variation across register). Developed by Biber (1988; p.157), this methodology
enables us to investigate different types of texts from different registers by
means of measuring co-occurring linguistic features through more
sophisticated, multivariate statistical methodologies. In this way, we can
gain a better insight into language variation across registers (e.g., to
identify dimensions of linguistic variation) or reach comprehensive linguistic
descriptions of linguistic variation in already-identified dimensions (e.g.,
to study variation in the context of specialised language domains, p.158) (cf.
Multidimensional analysis, Loewen & Plonsky, p.119-120).


DCL offers a very practical introduction and is clearly aimed at students who
want to learn how to build their own corpus-based project. While Parts 1 and 2
are a very concise and comprehensive introduction to what CL is, for it
enables the reader to have their first experience searching in corpora
utilising a corpus approach of language variation. Building on this, Part 3
addresses the basic technical and statistical aspects involved in every corpus
research project. In accordance with the premise of learning-by-doing, DCL
presents a concise guide on how to do it, including things like how to choose
a research topic (p.76ff) and how to formulate research questions in terms of
hypotheses for statistical tests (p.106ff). 

Aspiring corpus linguists will need to be familiar with the basics of
statistics and the preferred statistical methodologies of the discipline. A
minor criticism of Chapters 6 and 7 is that the complexity of the subject is
such that it is impossible to offer a comprehensive overview in an
introductory handbook about CL; therefore, the theoretical explanations of
statistical concepts remains superficial, and sometimes the use of symbols can
be a challenge to beginners (e.g., the symbol R2 appears on p.117; however, we
know nothing about it until it is explained on p.125). One suggestion I have
for improving the text is to add the abbreviations, symbols, and statistical
terms used as a tabular appendix separate from the overall index. 

Another criticism is the selection of working with SPSS based upon its
user-friendliness. However, it is almost impossible to have it installed in a
private lap-top due to its costly licence. Leading researchers are already
working with free software, particularly R, and free software should not be
avoided just because it requires a certain level of programming skills.

Linguistics as a science is currently utilising quantitative methodologies,
which are enabling linguistics to develop as a discipline, bringing it in line
with sociology and psychology. Surely this is in part due to CL research. This
does not mean that an introspective view of language will lose its validity
for ''introspection is irreplaceable in the descriptive documentation of
language'' (Janda, 2013:6; Leech, 2011). As the authors stress, beyond these
examples corpora can have many different applications and corpus techniques
are currently applied in addressing a wide range of subdomains in applied
linguistics, including sociolinguistics, second language acquisition,
psycholinguistics, and translation studies (cf. Corpus, Loewen & Plonsky,

Unfortunately, there is still too little information about the positive impact
that working with CL can have, and it is not even considered a subject in the
linguistic curricula of (German) universities. This applies in particular to
the Romance language departments. However, learning from its practical
applications up to the point where corpus-based methodologies can be directly
or indirectly utilised can only be beneficial to the professional perspective.
Especially in the era of Big Data and its increasing complexity, CL offers the
tools that will become indispensable to a solid linguistic education: there is
no escaping this fact. In this regard, DCL offers a very valuable and
inspiring point of departure.  


Anthony, Laurence. 2014. AntConc (Version 3.4.3m) [Computer Software]. Tokyo,
Japan: Waseda University. Available from http://www.antlab.sci.waseda.ac.jp/  

Anthony, Laurence. 2014. AntWordProfiler (Version [Computer
Software]. Tokyo, Japan: Waseda University. Available from

Davies, Mark. 2008-. The Corpus of Contemporary American English: 520 million
words, 1990-present. Available online at http://corpus.byu.edu/coca/. 

Gries, Stefan Thomas. 2008. Statistik für Sprachwissenschaftler. Göttingen:
Vandenhoeck & Ruprecht.

Janda, Laura A. 2013. Quantitative methods in Cognitive Linguistics: An
introduction. In Cognitive linguistics: The quantitative turn. The essential
Reader. L. A. Janda (ed), 1-32. Germany: De Gruyter. 

Leech, Geoffrey. 2011. Principles and applications of Corpus Linguistics. In
Perspectives on Corpus Linguistics (Studies in Corpus Linguistics 48). V.
Viana, S. Zyngier & G. Barnbrook (eds), 155-170. Amsterdam/Philadelphia: John

Loewen, Shawn & Plonsky, Luke. 2016. An A – Z of Applied Linguistics Research
Methods. UK: Palgrave.

McEnery, Tony & Hardie, Andrew. 2012. Corpus linguistics: method, theory and
practice. Cambridge & New York: Cambridge University Press.


Mariana España-Rivera is a lecturer at the Department of Romance Languages and
Literatures at the University of Marburg (Germany). She earned a M.A. in
Romance Linguistics, Musicology and European & Latin American Art History from
the University of Heidelberg. Her teaching and research interests include
Applied Linguistics, Historical Linguistics and Latin American Cultural
Studies. She is currently building an own Corpus of Academic Written Spanish
from German students for research purposes.


*****************    LINGUIST List Support    *****************
                       Fund Drive 2016
Please support the LL editors and operation with a donation at:

        Thank you very much for your support of LINGUIST!

LINGUIST List: Vol-27-5068	

More information about the LINGUIST mailing list