37.515, Reviews: Corpus Linguistics for Language Learning Research: Pascual Pérez-Paredes; Geraldine Mark; Anne O'Keeffe (2025)
The LINGUIST List
linguist at listserv.linguistlist.org
Sat Feb 7 21:05:02 UTC 2026
LINGUIST List: Vol-37-515. Sat Feb 07 2026. ISSN: 1069 - 4875.
Subject: 37.515, Reviews: Corpus Linguistics for Language Learning Research: Pascual Pérez-Paredes; Geraldine Mark; Anne O'Keeffe (2025)
Moderator: Steven Moran (linguist at linguistlist.org)
Managing Editor: Valeriia Vyshnevetska
Team: Helen Aristar-Dry, Mara Baccaro, Daniel Swanson
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org
Homepage: http://linguistlist.org
Editor for this issue: Helen Aristar-Dry <hdry at linguistlist.org>
================================================================
Date: 07-Feb-2026
From: Boshra ElGhazoly [bghazoly at taibahu.edu.sa]
Subject: Pascual Pérez-Paredes; Geraldine Mark; Anne O'Keeffe (2025)
Book announced at https://linguistlist.org/issues/36-2320
Title: Corpus Linguistics for Language Learning Research
Series Title: Research Methods in Applied Linguistics
Publication Year: 2025
Publisher: John Benjamins
http://www.benjamins.com/
Book URL: https://benjamins.com/catalog/rmal.12
Author(s): Pascual Pérez-Paredes; Geraldine Mark; Anne O'Keeffe
Reviewer: Boshra ElGhazoly
Summary
A most needed and well formulated textbook, Corpus Linguistics for
Language Learning Research, is published in the series Research
Methods in Applied Linguistics (RMAL), by John Benjamins. The book
provides an underserved readership, i.e., novice researchers and
graduate students majoring in applied linguistics, SLA, and language
pedagogy, with foundational knowledge and introduction to the
compilation of a valid corpus, step by step illustration of various
techniques of corpus analysis, and applications for language learning
and teaching (e.g., Data Driven Learning (DDL), in addition to lists
of well-known corpora and review of key learner corpus research.
Interestingly, the book starts with a forward, a testimonial by a
renowned professor, Michael McCarthy, who shares his (personal)
insights on the book’s topic and how Pérez-Paredes, Mark & O'Keeffe,
with the lens of corpus linguistics, skillfully explore the entangled,
dichotomous and multidisciplinary landscape of applied linguistics,
SLA, and language pedagogy and how learner corpora can enrich language
learning research regardless of the contentious terminology. The
Forward depicts the scene before this book came out. Upon searching
for a manual for CL, novice probers of this area would find abundant
resources on the sophisticated, highly specialized computational
linguistics resources, i.e., peripheral corpus linguistics tools. They
would come across scattered, perhaps less formal, tutorials, or at
best a well curated resource that is part of a handbook, for the less
complicated (although much needed) central corpus linguistics. Indeed,
this book addresses the need to bring simple corpus linguistics tools
and methodology inclusively to a broad range of language learning
professionals using real samples of languages and methodologies in a
clear and accessible manner. The book has 8 chapters and is organized
into four parts.
Part 1, entitled Corpus Linguistics for Language Learning research:
Global considerations, has three interconnected chapters. In Chapter
1, Pérez-Paredes, Mark & O'Keeffe provide necessary background on
three key concepts that corpus linguistics explores in the
interrogation of corpora (frequency, contrast, and
representativeness). Chapter1 starts by historicizing the field of
corpus linguistics and contextualizing (learner) corpus study by
connecting it to contrastive interlanguage analysis. Chapter 1
showcases how corpus linguistics can be used in contrastive
interlanguage analysis to provide rich probing into actual L1ers’
language use and L2ers’ performance in ways that can reveal effects of
L1 transfer and levels of proficiency which factor in the process of
language acquisition. Also, Chapter1 smoothly introduces the prospects
that corpus linguistics can offer to second language acquisition
research in different designs (cross-sectional, longitudinal and mixed
designs). In Chapter 2, the features of a well-structured textbook
swiftly emerge in the placement of recommended readings (books and
book chapters) at the beginning of the chapter with a brief summary of
the selected book(s) or chapter, an organization that will continue
throughout the book. At some point, the explanation takes the format
of a tutorial, which is most welcome of course. Particularly, in
Chapter 2, the main functions of available corpus tools (frequency
lists, concordancing and key word functions) for words and multi word
units, as well as their usage in SLA research, are given with, for
example, screen shots illustrating the output of AntConc software.
With this reader friendly tutorial style, by the end of Chapter 2, the
reader should know basic information like the right file format for
corpus software and would understand the calculation of normalization
of frequencies, concordancing, keyness (e.g., positive and negative
keywords) and the use of N-Grams. Importantly, the reader would know
that a concordance cannot do the whole job and that it is the task of
the researcher alone to interpret the datasets of learner corpora.
Chapter 3 explores different kinds of learner corpora, learner corpus
data and research designs, along with important concerns and different
research outlooks that would affect the design and curating of
corpora. The novice researcher should pay attention to these
considerations for learner corpus research design, as well as to the
lists of available corpora. The reader should know, as Chapter 3
advises, that accessing the corpora online or offline might not always
be free. They may also need to correspond to check the likelihood of
tightening the datasets to suit their research questions according to
specific factors (e.g., L1 background, elicitation tasks, level of
proficiency) and to check whether the data set is coded or uncoded.
For example, the longitudinal Corpus and Repository of Writing (CROW),
which was collected over the course of two years, has different paid
levels of access. A graduate student or a novice researcher would
benefit from finding out such information as he or she embarks on this
journey. Also, understanding the basis for learners' benchmarking in a
corpus is important, i.e., the standardized test for placement into
levels adopted in a corpus is important. For example, as mentioned in
Chapter 3, the EF-Cambridge Open Language Database (EFFCAMDAT) uses
benchmarking to CEFR A1-C2 and allows for benchmarking to TOEFL while
CROW uses the TOEFL overall score 80-105 categorized as high
intermediate to advanced. Importantly, Chapter 3 emphasizes that
assessment methods could be limited in analyzing proficiency at higher
levels. Reading further about this would be highly recommended.
Part II, Units of Analysis in Learner Language Research (LCR), has
three chapters that focus on operational issues related to the process
of conducting learner corpus research. Chapter. 4 looks into the
implementation of the contrastive interlanguage analysis. i.e.,
comparing different interlanguages or comparing learners’ performance
with native speakers’ linguistic output. Specifically, it explores
corpus linguistics analyses of word classes (e.g., nouns, verbs, and
adverbs) in LCR. Here, word classing and Part Of Speech (POS) tagging
is shown as a preliminary step prior to the introduction of the
methods used for the study of collocations, colligation and
collostructions in Chapters 5 and 6, respectively. Also, Chapter 4
provides a summary of early learner corpus research, uses the
International Corpus for Learner English to model POS tagging and
exploration of word classes and has plenty of operational
recommendations, e.g., the importance of a careful selection of POS
tags. as well as examples of the consequences of misuse of a POS
tagger. Different examples of tag sets are provided. Also, novice
researchers are cautioned about the use of raw files of learners that
contain misspellings as this is the primary cause for tagging errors.
The point stressed in this discussion is that POS tag sets can affect
the analyses. Chapter 4 is highly informative as it provides detailed
and cautionary information.
Chapter 5 marks the shift of interest in LCR toward the study of
collocations. Simple methods to investigate collocations in learners’
data in corpora are smoothly introduced using Sketch Engine. The
reader is exposed to how collocation data can be retrieved and
analyzed along two dimensions, i.e., frequency and exclusivity. A
point of strength in this chapter is the comparison and contrast of
different statistical methods, i.e., T-score, Mutual information (MI),
logDice and Delta P., which collectively provide background on the
affordances of selecting a statistical method. On top of that, the
research designs of three different well selected learner corpus
studies, i.e., Durrant and Schmitt (2009), Kreyer (2021), and Wang
(2016), are closely inspected, with the operational steps given in an
orderly and simple manner. By the end of this chapter, the reader
would have learned what to watch for in designing a study to
investigate collocations using corpora.
Chapter 6, along with the suggested references, provides the reader
with a substantial background on colligation, i.e., the recurring
grammatical pattern of a word (e.g., anaphoric and cataphoric
signaling nouns) in learner corpus research. The grammatical features
explored include Noun+ preposition; verb + ing Vs to. Infinitive;
clausal preferences triggered by a specific word; the tendency for
some words to show up in a specific position in a clause (aka lexical
priming). Representative articles in this chapter highlight the role
of manual inspection as a preliminary step for analyses, stress the
developmental attributes of colligation in interlanguages; showcase a
relationship between learner input exemplified in textbooks and L2
writing performance, provide operational tips for the identification
of nouns and introduce collexeme analyses, including distinctive and
covarying ones. Combined together, Chapters 5 and 6 comprehensively
cover collostruction analyses of collocations and colligations.
Part III, Researching Corpus Applications in Language Learning and
Teaching, is comprised of two chapters, 7 and 8, which survey corpus
research methods and applications in language teaching and learning
generally known as Data Driven Learning (DDL), i.e., making use of
corpus tools and methods for language learning, Particularly, Chapter
7 looks into research on the indirect corpus applications/hands off
Data Driven Learning (DDL) where corpus data are invisible to the
learners. Chapter 8, on the other hand, looks into the direct/visible
usages of corpora in classrooms, i.e., “Hands on DDL” (p.150).
Particularly, Chapter 7 showcases key corpus methods in reference to
previous chapters in the book that can be used in creating
corpus-informed materials (e.g., dictionaries, grammar and English
learner books). The reader can imagine at this point how the role of a
lexicographer has changed operationally with the presence of corpus
tools. Also, Chapter 7 depicts how corpus informed materials can
expand our awareness of register or text type productive or underused
forms and how register would differ in L1 and L2 academic usages as
well. Chapter 7 provides a review of interfaces and/or applications
that can be used for assessing learners' writing, e.g., Lextutor, Text
Inspector and TAALES (Tool for the Automatic Analysis of Lexical
Sophistication.
In Chapter 8, Data Driven Language Learning (DDL) is brought to the
spotlight from a pedagogy perspective as it provides a basis for a
lexico-grammatical approach in language teaching. The efficacy of DDL
pedagogy has been studied by comparing it to traditional methods and
by comparing the two types, direct and indirect DDL. The reported
findings point to the efficacy of DDL pedagogy in general. However,
caution in applying indirect DDL to lower-level learners is
recommended. Evaluation and critique of 3 studies on DDL pedagogy is
provided in terms of corpora size, context, inductive or deductive
method of teaching, DDL session integration or separation in the
research design, measures for specifying participants’ level of
proficiency, and the language under research (with English being the
dominant one, of course). Obviously, by the end of this chapter, the
reader would know that more research is still needed, e.g., into the
role of the teacher in direct and indirect DDL pedagogy with respect
to complexity, accuracy and fluency in writing in first and second
language contexts, as well as in other skills. Interestingly, the need
for researching DDL further led to calls to link it to constructivism
(e.g., learner centeredness), sociocultural theory (e.g., learner
agency, self-regulation and mediation between the teacher and the
learners) and SLA (input flooding, input enhancement, involvement load
hypothesis, noticing hypothesis and usage-based model).
Evaluation
I cannot think of an applied linguistics program at this time that
would not require a core class on corpus linguistics in language
(learning) research regardless of whether the goal of study is L2
pedagogy, language acquisition or even translation. The resources for
corpus linguistics are scattered here and there; this can constitute a
huge distraction, especially for a novice researcher who does not have
a background in computational linguistics. Fortunately, this book
would be a perfect fit for a class targeting graduate students of
applied linguistics and novice researchers of any neighboring
discipline. Not only does it serve its purpose, i.e., providing
introductory knowledge of how corpus linguistics can serve the study
of language learning, but the book also accommodates the needs of
different intersecting disciplines language pedagogy, language
acquisition and/or language learning in a smooth and well-organized
manner. The book offers readers both a grasp of corpus methods and
insights for advancement of their own research. Importantly, at each
stage and chapter the book lists numerous informative recommended
readings (e.g., Anthony’s chapter of what software can do? (2022);
Bell and Payant’s chapter on designing learner corpora; Sinclair’s
Reading concordances (2003) among many others); these delineate a
syllabus style or chart a road map for newcomers to the field.
Representations of data, i.e., figures, foci, screen shots of panels
illustrating corpus searches (e.g., CQL) and tables are a real asset.
As the field is progressing rapidly, updates and additions to the
recommended readings or methods might be required from the authors
soon enough. I suggest the inclusion of a chapter on the use of AI in
corpus compilation. AI is transforming the way corpora can be compiled
and used. Specifically, it is changing automated data collection, data
cleaning and normalization, annotation and tagging, feature
extraction, and the creation of specialized corpora; most importantly,
it offers a wide range of data and provides opportunities for its use
to researchers who do not have an NLP background. Also, it would be
helpful in subsequent revised editions to include a list of main
abbreviations at the beginning of the book for novice researchers and
graduate students.
References
Anthony, L. (2022). What can software do? In A. O’Keeffe, & M.J.
McCarthy (Eds.), Routledge handbook of corpus linguistics (2nd ed.,
PP. 103-125). Routledge.
Bell, P., & Payant, C. (2021). Designing learner corpora: Collection,
transcription, and annotation. In N. Tracy-Ventura & M. Paquot (Eds.),
The Routledge handbook of second language acquisition and corpora
(pp.53-67). Routledge.
Durrant, P., & Schmitt, N. (2009). To what extent do native and
non-native writers make use of collocations?. International Review of
Applied Linguistics in Language Teaching, 47(2), 157-177.
Kreyer, R. (2021). Collocations in leaner English. In P. Perez-Paredes
& G. Mark (Eds.). Beyond concordance lines: Corpora in language
education (pp. 97-120). John Benjamins.
Sinclair, J. McH. (2003). Reading concordances. Pearson
Staples, S., & Dilger, B. (2018). Corpus and repository of writing
[Learner corpus articulated with repository]. Available at
https://crow.corporaproject.org
Wang, Y. (2016). The Idiom Principle and L1 influence. A contrastive
learner-corpus study of delexical verb + noun collocations. John
Benjamins.
ABOUT THE REVIEWER
Boshra ElGhazoly holds the position of Assistant Professor of
Linguistics at the Dept. of English Language and Literature, Faculty
of Arts, Menoufia University, Egypt and Taibah University, KSA. She
obtained her Ph.D (dual degree in Linguistics and Second Language
Studies), and MA (TESOL/Applied Linguistics) from Indiana University,
Bloomington, USA. Her research interests include morphosyntax, SLA,
and translation.
------------------------------------------------------------------------------
********************** LINGUIST List Support ***********************
Please consider donating to the Linguist List, a U.S. 501(c)(3) not for profit organization:
https://www.paypal.com/donate/?hosted_button_id=87C2AXTVC4PP8
LINGUIST List is supported by the following publishers:
Bloomsbury Publishing http://www.bloomsbury.com/uk/
Cambridge University Press http://www.cambridge.org/linguistics
Cascadilla Press http://www.cascadilla.com/
De Gruyter Brill https://www.degruyterbrill.com/?changeLang=en
Edinburgh University Press http://www.edinburghuniversitypress.com
John Benjamins http://www.benjamins.com/
Language Science Press http://langsci-press.org
Lincom GmbH https://lincom-shop.eu/
MIT Press http://mitpress.mit.edu/
Multilingual Matters http://www.multilingual-matters.com/
Narr Francke Attempto Verlag GmbH + Co. KG http://www.narr.de/
Netherlands Graduate School of Linguistics / Landelijke (LOT) http://www.lotpublications.nl/
Peter Lang AG http://www.peterlang.com
SIL International Publications http://www.sil.org/resources/publications
----------------------------------------------------------
LINGUIST List: Vol-37-515
----------------------------------------------------------
More information about the LINGUIST
mailing list