35.936, Review: The Language of Fake News: Grieve & Woodfield (2023)
The LINGUIST List
linguist at listserv.linguistlist.org
Thu Mar 14 20:05:02 UTC 2024
LINGUIST List: Vol-35-936. Thu Mar 14 2024. ISSN: 1069 - 4875.
Subject: 35.936, Review: The Language of Fake News: Grieve & Woodfield (2023)
Moderators: Malgorzata E. Cavar, Francis Tyers (linguist at linguistlist.org)
Managing Editor: Justin Fuller
Team: Helen Aristar-Dry, Steven Franks, Everett Green, Daniel Swanson, Maria Lucero Guillen Puon, Zackary Leech, Lynzie Coburn, Natasha Singh, Erin Steitz
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org
Homepage: http://linguistlist.org
Please support the LL editors and operation with a donation at:
https://funddrive.linguistlist.org/donate/
Editor for this issue: Justin Fuller <justin at linguistlist.org>
LINGUIST List is hosted by Indiana University College of Arts and Sciences.
================================================================
Date: 15-Mar-2024
From: Elizabeth Craig [eccraig at uga.edu]
Subject: Forensic Linguistics: Grieve & Woodfield (2023)
Book announced at https://linguistlist.org/issues/34.3166
AUTHOR: Jack Grieve
AUTHOR: Helena Woodfield
TITLE: The Language of Fake News
SERIES TITLE: Elements in Forensic Linguistics
PUBLISHER: Cambridge University Press
YEAR: 2023
REVIEWER: Elizabeth Craig
SUMMARY
The Language of Fake News was authored by Jack Grieve (University of
Birmingham, UK), a quantitative corpus linguist more broadly
interested in dialectology, author identification, and language
change, and Helena Woodfield, a doctoral researcher specializing in
fake news. They present a concise outline of the linguistic
characteristics of fake news from the perspective of register
variation. Such a study falls under the purview of forensic
linguistics in that it investigates the distinctive grammatical
features of the news articles of one author, Jayson Blair, who
deliberately produced both true and false reports for The New York
Times (NYTs) in the early 2000s. The celebrated journalist was forced
to resign after some issues with his ‘factual’ accounts arose and
suspicions of plagiarism were discovered by colleagues at The
Washington Post, a major rival to his publication. The discovery of
the extent of his deceit eventually led to the firing of Blair and two
of his editors.
In the introductory chapter, the authors offer clarity: fake news is
not only false information; it must also be intentionally deceptive.
They further contend that the “distinctive communicative functions” of
real as opposed to fake news involve the use of differing linguistic
structures. Because one is meant to inform and the other to deceive,
this study indicates that we should expect the use of disparate
grammatical forms. The hope is that, given these differences, a purely
linguistic analysis could aid in detecting intentional deception.
The term ‘fake news’ became a popular accusation during the Hillary
Clinton/Donald Trump campaigns for the US presidency in 2016, but the
authors submit that fake news is as old as the news itself. The advent
of the internet, a 24/7 news cycle, and news-as-entertainment
(directed at selling ads) has made it much more ubiquitous. As a
result, the distrust of both government and media has become extremely
widespread, and everyone can choose where they get their news, be it
true or false. The authors feel that these three points--the
widespread dissemination of fake news, its social impact, and its
distinctive linguistic characteristics--are what make such a study so
important now.
They further wish to differentiate the present study from those that
have gone before on fake news; such earlier studies in natural
language processing (NLP) tended to focus on language content,
meaning, and topic, rather than on linguistic structure. The present
study potentially offers more promise of “automatically classifying”
news as genuine or fake because it relies on abstract, objective
categorizations (parts of speech as “principled sets of linguistic
features”), rather than on superficial word or phrase choices, as in
the earlier studies. The authors suggest that each methodology should
serve to substantiate rather than replace the other.
The second chapter presents a critical review of past research on the
language of fake news, with a focus on its shortcomings. They begin by
describing the limitations of veracity-based studies utilizing NLP
methods. The authors feel that machine learning systems can determine
only if a news story is false, not whether it is fake, because the
method analyzes language content without accounting for register
variation and disinformation, or intentional deceit. Grieve and
Woodfield propose that their framework, in focusing on linguistic
structure, takes these factors into account through a comparison of
the linguistic structures in two parallel corpora from the same author
and further that this methodology offers an explanation, which is
required to distinguish disinformation from mere misinformation, i.e.
fake news from false news. The authors take issue with including
disinformation as a subcategory of misinformation, when it should be
viewed as qualitatively distinct from the latter. “(P)eople can
inadvertently communicate falsehoods when they intend to share
accurate information, and this should not be confused with
lying…people can also state the truth when they intend to deceive if
they are misinformed themselves” (p. 12-13).
The problem with NLP methods is that texts are being judged along only
one dimension, whether they are true, when the author’s honesty is
another dimension to be considered. Another issue with the
veracity-based approaches that dominate the current research to fake
news analysis is that whole articles may be classified as either fake
or true, a binary distinction, while having some true or false
statements within them. Furthermore, such judgments are highly
subjective. These researchers set out to focus on untrue news that was
intended to deceive by one author. And by analyzing the language usage
of only one author, they can be certain they are not dealing with
differences in register and/or dialect: the only difference in the two
corpora compared is that one is true and honest (genuine), and the
other is false and dishonest (fake). All other differences are
controlled for; this is what is meant by a corpus being a principled
collection of texts. A principle guiding this research is that studies
on register variation have demonstrated that there are clear and
systematic grammatical differences depending on contexts of use and
function (Biber, et al. 1988).
Chapter Three goes into Jayson Blair’s stint at the NYT, his rapid
climb, and how his downfall resulted from a series of noted
discrepancies by several of his colleagues. Blair experienced a
relatively swift rise in his career as a newspaper journalist, which
some attributed to his race (African American), and he was soon
promoted to the National Desk in 2002, where he covered the DC Sniper
case and the Iraq War.
Chapter Four describes the building of the two parallel corpora for
comparison of the language structures used in the genuine versus the
fake news stories, which were determined by the newspaper. The authors
decided to include only those articles written during the six-month
period under scrutiny by Blair’s employer and removed any articles
that were co-authored. Only the main text of the articles appears in
the two corpora, with no titles, captions, etc., which would not be in
complete sentences, making grammatical categorization by an automatic
tagger problematic. Also, short articles (under 300 words) were
discarded leaving only those between 321 and 1,825 words in length.
The total corpus came to be just under 57,000 words with about a
60%/40% split between fake/true stories coming from 36/28 articles
respectively. Only about half of Blair’s articles on the DC Sniper
were fake, whereas all his articles on the Iraq War, which came later,
were fake. From graphs, we can see that Blair became more prolific and
more dishonest over time.
The authors admit here to three shortcomings with this corpus: 1) it
is extremely small and therefore not conducive to a rigid statistical
analysis; 2) there is a large difference in the rate of real vs. fake
news depending on topic, which could point to register variation as a
factor; and 3) it represents the writings of only one author, but this
is by design and is meant to lend credence to the findings.
Chapter Five quantifies the main grammatical features of each of the
two corpora and seeks to explain the differences in the two. The
authors examine the relative frequencies among 49 grammatical features
(each measured per 100 words) and establish 28 that represent
‘non-negligible’ differences. In general, “when Blair is telling the
truth, he tends to write more densely and with greater conviction” (p.
38). An automated, multidimensional analysis tagger is applied
claiming a 90% accuracy rate so that prior insights from register
analysis (Biber, et al. 1988) could be exploited, but the authors
submit that “any sufficiently accurate part-of-speech tagger would
allow for similar patterns to be broadly observed” (p. 38).
Relative frequency is determined utilizing an equation referred to as
Cliff’s delta, which provides non-parametric significance values for
ordinal values, to determine the degree of difference in the usage of
each grammatical structure in the two corpora. Blair’s genuine news
articles are found to have longer average word lengths, more nouns and
nominalizations, time adverbials, gerunds, and participial adjectives;
the fake news articles include more emphatics, present tense verbs,
perfect aspect verbs, adverbs, the copula-be, predicate and
attributive adjectives, subordinators, and five types of pronouns.
This finding aligns with established distinctions in the grammatical
traits of informational as opposed to interactional registers. “A
dense style is the standard for newspaper writing because it allows
for detailed information to be conveyed in a limited space” (Biber, et
al. 1988). Nominal density is a basic characteristic of informational
prose. The authors offer two reasons for the lack of information
density in Blair’s false reporting: he was under a lot of pressure in
his job to be productive, and he did not have time to produce articles
of appropriate conciseness.
Four verb features the authors found to be “highly marked” in Blair’s
real articles because they would not be expected are: suasive verbs,
possibility modals, by-passives, and public verbs. Also, Wh-relatives,
which involve nouns, were more common in the fake news. The authors
attribute this anomaly to stance, or the author’s conviction about the
information being conveyed. For example, in looking at the public verb
‘to say,’ they find that he uses it in the present tense almost
exclusively in the articles in which he is lying and only in the past
tense in the true articles.
Chapter Six concludes with a summary of their results and the
implications of this study regarding the identifying attributes of
fake news. Grieve and Woodfield propose to explain why certain
grammatical patterns serve to distinguish fake news from real
reporting. They contend that the twenty-eight structural features
identified as significant quantitatively point to a stylistic
variation in the two text types regarding information density and
conviction. Further, they maintain that Blair’s fake news reports were
less nominally dense because of the pressure to publish a large
quantity in a short period of time; this relative paucity of nouns in
the intentionally falsified texts also marks them as more uncertain.
Past NLP models are again criticized here for focusing only on the
veracity aspect (content) of untrue news reports, while these authors
wish to draw attention to the intent to deceive (dishonesty), which
they contend is revealed by subconscious linguistic choices and is,
after all, of more social import. The authors hope this study
contributes to the future development of large-scale, fake news
detection.
EVALUATION
In this Element, the authors introduce and apply a framework for the
linguistic analysis of fake news. They define fake news as false
information that is meant to deceive, and they argue that there are
systematic differences between real and fake news that reflect this
basic difference in communicative purpose. The authors consider one
famous case of fake news involving Jayson Blair of The New York Times,
which provides them with the opportunity to conduct a controlled study
of the effect of deception on the language usage of a single reporter
following this framework. Through a detailed grammatical analysis of a
corpus of Blair's real and fake articles, they demonstrate that there
are clear differences in his writing style, with his real news
exhibiting greater information density and conviction than his fake
news. While information density can be determined by a preponderance
of nouns and their cohorts (adjectives, prepositions, etc.), I feel
conviction is a more subjective measure that can only be determined
with some consideration given to word choice. I find it difficult to
make the leap from information density to conviction in the absence of
a semantic analysis, which the authors provide in their discussion of
stance in Chapter Five.
One weakness admitted by these authors is that this corpus is small.
Indeed, by today’s standards a corpus of half a million words is
considered small because of the power of machine processing. But this
size limitation was an outcome of adhering to the principle of using
a single author, which leads us to another issue discussed below.
I also find here the same issue these authors raise in Chapter Two as
a problem with NLP methods: not every sentence in the fake news
articles is false. In other words, we are still applying a binary
distinction to whole articles as true or fake when certainly there are
true statements within each article, which could affect the numbers.
Another lingering question is that of authorship. The reason Blair was
eventually outed was his rampant plagiarism, which leads us to wonder
just how much of his later writings was his own. Is the corpus under
examination here really a single-author text, a principle that was put
forth as a basic criterion for the corpus construction? It would seem
there would need to be a comparison of Blair’s fake news reports to
the stolen reports. Were they merely copied and presented as his own,
or did he make any attempts to disguise his submissions for
publication? These researchers say that they eliminated from this
study any articles that were co-authored. But to what extent were
Blair’s fake news articles plagiarized, and were those articles
included in the study? Certainly, if he plagiarized whole articles,
they may have been true, but there was deception. There was no
discussion of how the plagiarism was dealt with by these researchers
other than their noting that it was Blair’s undoing. The plagiarized
articles may not represent Blair’s writing style at all. Some
comparison of the known-to-be plagiarized articles to what we know to
be Blair’s authentic writings would have served well here.
This entire scenario recalls the career of another fallen-star
reporter, Stephen Glass, former associate editor for The New Republic
until he was discovered to have concocted stories from whole cloth in
1998. Glass was known as a meticulous fact-checker who had “provided
copious notes and letters, business cards, e-mail addresses–much of
which is now believed to have been fabricated” (St. John 1998).
Therefore, one would expect his submissions to include names,
companies, places, and dates, i.e. proper nouns, which would
contribute to nominal density in his fabricated stories. In the
cinematic portrayal of his career trajectory, ‘Shattered Glass’
(2003), the prodigious writer eventually admits to falsifying 27 of 41
stories. It would be interesting to determine if the linguistic
patterns distinguishing fact from fiction in Blair’s journalism remain
valid for Glass’ prose as well, since they both relate to the same
register of language usage, newspaper reporting. I am surprised that
Grieve and Woodhouse make no mention of this extremely similar case
occurring just prior and about which a rather famous movie was made,
especially since Stephen Glass is mentioned in the title of one of
their own references (Spurlock 2016).
This work constitutes a very interesting, timely, and relevant
contribution to the field of deception detection in news reporting
through forensic linguistics. The ability to determine fakeness by a
preponderance of certain grammatical patterns would be a useful tool
indeed for discerning deception in journalistic writing in general.
Statements contrary to fact can be challenging to prove, but it can be
done; what is more difficult to verify with certainty is the intention
to deceive through a quantitative analysis of grammatical structure
(no matter how much we may want it to work). I find it hard to make
this leap quite yet. It is imperative that we move swiftly forward
with this kind of research in a world of machine-generated information
and rapidly growing artificial intelligence capabilities. My fear is
that both Blair and Glass might still be flourishing in news reporting
today if they had had access to such a facilitator as ChatGPT!
REFERENCES
Biber, D. 1988. Variation across speech and writing. Cambridge:
Cambridge University Press.
Ray, Billy. 2003. Shattered Glass. Lionsgate Films. Retrieved January
12, 2024, from https://www.youtube.com/watch?v=LdtWcXAQ2Q0
St. John, Warren. 1998. How journalism’s new golden boy got thrown out
of New Republic. Observer. Retrieved December 04, 2023, from https://o
bserver.com/1998/05/how-journalisms-new-golden-boy-got-thrown-out-of-n
ew-republic/
Spurlock, J. 2016. Why journalists lie: The troublesome times for
Janet Cooke, Stephen Glass, Jayson Blair, & Brian Williams. ETC: A
Review of General Semantics, 73(1), 71–76.
ABOUT THE REVIEWER
Dr. Elizabeth Craig is a freelance editor and ESL Instructor. She
holds a master’s degree in TESOL and a doctorate in linguistics.
eccraig at uga.edu
------------------------------------------------------------------------------
Please consider donating to the Linguist List https://give.myiu.org/iu-bloomington/I320011968.html
LINGUIST List is supported by the following publishers:
Cambridge University Press http://www.cambridge.org/linguistics
De Gruyter Mouton https://cloud.newsletter.degruyter.com/mouton
Equinox Publishing Ltd http://www.equinoxpub.com/
John Benjamins http://www.benjamins.com/
Lincom GmbH https://lincom-shop.eu/
Multilingual Matters http://www.multilingual-matters.com/
Narr Francke Attempto Verlag GmbH + Co. KG http://www.narr.de/
Wiley http://www.wiley.com
----------------------------------------------------------
LINGUIST List: Vol-35-936
----------------------------------------------------------
More information about the LINGUIST
mailing list