32.2416, Review: Computational Linguistics; Text/Corpus Linguistics: Landulfo Teixeira Paradela Cunha (2020)

Sun Jul 18 01:29:37 UTC 2021

LINGUIST List: Vol-32-2416. Sat Jul 17 2021. ISSN: 1069 - 4875.

Subject: 32.2416, Review: Computational Linguistics; Text/Corpus Linguistics: Landulfo Teixeira Paradela Cunha (2020)

Moderator: Malgorzata E. Cavar (linguist at linguistlist.org)
Student Moderator: Jeremy Coburn, Lauren Perkins
Managing Editor: Becca Morris
Team: Helen Aristar-Dry, Everett Green, Sarah Robinson, Nils Hjortnaes, Joshua Sims, Billy Dickson
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org

Homepage: http://linguistlist.org

Please support the LL editors and operation with a donation at:
           https://funddrive.linguistlist.org/donate/

Editor for this issue: Jeremy Coburn <jecoburn at linguistlist.org>
================================================================

Date: Sat, 17 Jul 2021 21:29:06
From: Nicolás Arellano [nicolas.a.arellano at gmail.com]
Subject: Contributions to the Computational Processing of Diachronic Linguistic Corpora

Discuss this message:
http://linguistlist.org/pubs/reviews/get-review.cfm?subid=36704937

Book announced at http://linguistlist.org/issues/32/32-683.html

AUTHOR: Evandro  Landulfo Teixeira Paradela Cunha
TITLE: Contributions to the Computational Processing of Diachronic Linguistic Corpora
SERIES TITLE: LOT Dissertation Series
PUBLISHER: Netherlands Graduate School of Linguistics / Landelijke (LOT)
YEAR: 2020

REVIEWER: Nicolás Arellano, Universidad de Buenos Aires

SUMMARY

‘Contributions to the Computational Processing of Diachronic Linguistic
Corpora’ is a book that aims to offer insight on multiple tasks involving
computational tools on the assessment of diachronic corpora. In order to do
so, the book presents at its core three chapters that discuss how to develop
new diachronic (and not necessarily historical) corpora. Not only could it be
helpful for personal research but also, and more importantly, it may serve as
the kick-start for the creation of more general databases.

Chapter 1 stands for the introduction, in which Cunha justifies his research
by exploring the many intersections between linguistics and computer science.
Besides the remarkable increase in studies regarding formal models of
language, which could represent the most obvious junction between such fields,
the author also focuses on displaying several uses of computational methods in
already established subdisciplines, such as sociolinguistics, language
preserving, and dialectology, among others. He does it without failing to
concentrate on what makes up the very essence of the research: computer-aided
studies within the scope of corpus linguistics. Cunha particularly examines
the diachronic aspect of corpus linguistics. Despite diachronic corpora (which
he opposes to historical corpora, in the sense that the former must deal with
change over time and cover a specific span, whether it ends in the past or the
present, while the latter concentrates on the past, but without taking shifts
and changes into account) having considerably provided new opportunities to
linguists, the author explores particular tools that help not only to work
with corpora, but also compile and analyze them. Each one of the following
chapters focuses on a different aspect of corpora development.

Chapter 2 deals with building and compilation. Particularly, this section
presents two resources. Firstly, an easy-to-use web scraper of comments from
news portals and websites. Secondly, an example of a freely available corpus
made of comments from a Brazilian news site is shown, which is based on the
web scraper aforementioned. Cunha accounts for the importance of news comments
corpora because this type of discourse has often been neglected due to
assumptions on its validity as a source of information. Therefore, most
general corpora tend to not include comments. Conversely, this type of
discourse genre could shed light on a number of researches that range from
language change and lexicology to language variation and social aspects of
language. The web scraper (i.e., an automated agent used to extract data from
a particular online source), named Xereta, is open-source and free (Cunha,
Magno & Almeida 2017). It allows the user to extract proper linguistic and
meta information from up to a thousand URLs. Thus far, it runs on two
Brazilian major news sites: UOL and Folha de São Paulo. Using the architecture
of the web scraper he designed, Cunha collected a (diachronic) corpus
containing more than two hundred thousand comments from 2016 to 2018 that
appeared at UOL. It also includes more than 7 million tokens and follows a
‘rich-get-richer’ pattern in both commentators and positive evaluations
categories, meaning that few people make many comments while many only
participate once or twice. Those few comments collect a considerable number of
likes, while many are often not liked or barely receive any feedback. Further
analysis using the corpus could be completed by using other corpus software,
such as AntConc. However, this particular corpus, unlike other considerable
small corpora of the same type in English or Portuguese, is not annotated. 

Chapter 3 explores methodological limitations in diachronic corpora and
presents an algorithm that aims to help identify establishment and
obsolescence of linguistic forms, whose criteria of recognition so far, and
especially on obsolescence, have not reached an agreement (Tichý 2018).
Partly, this obstacle is explained given the existence of a gap between the
point in time in which a word appears for the first time in a given language
and the time when a significant part of the population becomes familiar with
it (Tulloch 1991). For this reason, Cunha claims five possible states of a
linguistic form in a time period: a) established, b) obsolete, c) permanent,
d) short-lived, e) random. These categories are further defined based on
binary criteria. Under this analysis, corpora should be divided into uniform
time frames. If a target item is above an already defined threshold based on
relative frequency, it is assigned the factor 1. If it is below, then 0. After
the assignment, as a result, a binary chain with a particular pattern is
formed. For example, if one hundred years are taken into account as a time
span for the corpus, and at the same time, this is segmented into ten sub-time
frames of ten years each, all of the following sequences could appear:
0000111111, 0101010101, 1111100000, among other possible logical options.
These patterns are related to one of the five types of states: established,
random, and obsolete, respectively. As an additional advantage, the algorithm
allows for deviations in which the binary sequences are not as prototypical by
analyzing which sub-time frame stands for the least number of deviations in
relation to a more expectable sequence (all 0, all 1). Finally, Cunha
implements the algorithm on the Corpus of Historical American English (Davies,
2012) and shows favorable results when dealing with characterizing
established, obsolete, lost, and short-lived words, among others.

Chapter 4 presents a framework of analysis of corpora based on the examination
of changes in the expression ‘fake news’ both in the English-speaking world
and Brazil, specifically concentrated around the 2016 US election and the 2018
Brazilian presidential election. In this way, Cunha claims that the change of
interest in society around this particular subject ended up transforming the
linguistic expression itself, thus stating a link between certain terminology
and social changes. In order to prove this point, the author uses two
diachronic corpora of news articles. For English, he selects the NOW Corpus
(Davies, 2013), whereas for Brazilian Portuguese he precisely creates an
ad-hoc corpus consisting of almost five thousand tokens of the term ‘fake
news’ found in ten news sites from Brazil. Through an analysis that comprises
multiple techniques (web search behavior, co-occurring entities and general
vocabulary, co-occurrence networks, contextualized topics, and polarity, i.e.,
the sentiment around the utterance), Cunha observes that the interest in fake
news increased globally after the US election in 2016, when the term highly
specified around topics and contexts related to politics, and not the media
industry, as shown by the data before 2016. In Brazil in particular, during
and after the presidential election in 2018, the shift happened from US
politics or ‘fake news’ in society in general to subjects that especially
revolve around Brazilian domestic affairs. Indeed, the rise of public interest
in the term ‘fake news’, from niche to a widely known expression, entailed
changes in the conceptualization. 

The last section of the book, Chapter 5, briefly sums up the conclusions of
the investigation. These center around the outcomes of the main three chapters
(2-4). Additionally, Cunha anticipates a series of limitations on his
research, including the lack of annotation, the possibility of the Xereta
corpus only working with two news portals, and a certain degree of imbalance
in the samples of the utilized databases.

EVALUATION

Precisely, the main problems of the dissertation are focused on its
organization, already addressed by the author to some degree. Firstly, he
acknowledges the lack of a precise integration among the core chapters of the
book, which were originally conceived as three separate papers. Although one
could postulate a certain degree of a temporal sequence from Chapter 2 through
Chapter 4, involving the different stages of the methodology behind corpus
linguistics, ultimately leading to an example of analysis, each chapter feels
like a capsule itself, in which many different aspects are all addressed at
once, with little correlation between phenomena within the book. Moreover,
some computational tools are presented as easy to use; however a fair few need
to be complemented with additional instruments, such as lemmatizers or
specific corpus-oriented software, especially in Chapter 4. Additionally, some
of the outcomes, particularly in Chapter 3, tend to focus more on phenomena
related to spelling variation than grammar or lexicography.

Nonetheless, all of these possible issues could also be appreciated as
advantages, especially for people who may use this book to look for concrete
data or methodological advice. Furthermore, ‘Contributions to the
Computational Processing of Diachronic Linguistic Corpora’ remains a solid
work from a conceptual point of view and represents a great asset for the
integration of more sophisticated and accurate computational tools into the
domain of linguistics, without failing to provide a general outlook of both
computational and diachronic corpus linguistic aspects. More importantly, the
book successfully focuses on a language other than English (Brazilian
Portuguese) and aims to help remove barriers in scientific access by providing
already several free and simple-to-use options to work with. Also, special
concepts are always explained and contextualized, which makes for a very
easy-to-read dissertation for people with very little expertise in the field.
Finally, the work demarcates the path that research on computational
linguistics and diachronic corpus linguistics should follow. Several readers,
and most certainly Cunha, will pick up from here and find new and refined ways
to contribute to corpus linguistics through computational tools. 

REFERENCES

Cunha, Evandro L. T. P., Gabriel Magno & Virgilio Almeida. 2017. A elaboração
de um coletor e de um corpus de comentários extraídos de portais de noticias.
In Anais do X Congresso Internacional da Associação Brasilera de Lingüística
(ABRALIN), 764-771. Niterói: Universidade Federal Fluminense. 

Davies, Mark. 2012. Expanding horizons in historical linguistics with the
400-million word Corpus of Historical American English. Corpora 7(2). 121-157.

Davies, Mark. 2013. Corpus of News on the Web (NOW): 3+ billion from 20
countries, updated every day. Retrieved from https://corpus.byu.edu/now. Last
access on May 28, 2021.

Tichý, Ondřej. 2018. Lexical obsolescence and loss in English: 1700-2000. In
Kopaczyk, Joanna & Jukka Tyrkkö (eds.), Applications of pattern-driven methods
in corpus linguistics, 81-103. Amsterdam: John Benjamins.

Tulloch, Sara. 1991. The Oxford dictionary of new words: A popular guide to
words in the news. Oxford: Oxford University Press.

ABOUT THE REVIEWER

Nicolás Arellano is a Linguistics graduate student (Universidad de Buenos
Aires) with a scholarship granted by Consejo Nacional de Investigaciones
Científicas y Técnicas de Argentina. His main topic of research is Spanish
lexicology within a usage-based approach. He is interested in corpus
linguistics and has written several articles and presentations in relation to
this field. Additionally, he also has some experience in second language
acquisition and second language teaching.

------------------------------------------------------------------------------

***************************    LINGUIST List Support    ***************************
 The 2020 Fund Drive is under way! Please visit https://funddrive.linguistlist.org
  to find out how to donate and check how your university, country or discipline
     ranks in the fund drive challenges. Or go directly to the donation site:
                   https://crowdfunding.iu.edu/the-linguist-list

                        Let's make this a short fund drive!
                Please feel free to share the link to our campaign:
                    https://funddrive.linguistlist.org/donate/

----------------------------------------------------------
LINGUIST List: Vol-32-2416	
----------------------------------------------------------