[Ads-l] Research Topic: Estimating date of composition from a text passage

Wed Dec 29 21:07:58 UTC 2021

The problems are that actually looking at the page or using humans to QA their data is expensive and that the errors caused by bad OCR scans do not result in significant lost revenue. Therefore, it's not worth them to do it.

At least the AI solution appeals to the Silicon Valley mindset that engineers can solve any problem and that using computers is preferable to using people. Using AI has the same market drawbacks that using people does, but they just might implement it because its a "cool" problem to solve.

-----Original Message-----
From: "Peter Reitan" <pjreitan at HOTMAIL.COM>
Sent: Wednesday, December 29, 2021 3:10pm
To: ADS-L at LISTSERV.UGA.EDU
Subject: Re: [ADS-L] Research Topic: Estimating date of composition from a text passage

As far as Newspapers.com mis-datings, it shouldn’t take artificial intelligence to solve most of their problems – looking at the date on the newspaper, instead of using a bad OCR scan that probably led to the problem years earlier, would help.

Large swathes of their collection are misdated – with most of the ones I’ve uncovered coming in just a few different newspapers, for which five or six years of the collection are all misdated to the same time frame.

I contacted them years ago on a “contact us” feature, designed to point out problems with specific scans – but nothing ever changed.

If the people who maintained the database actually used their database, they would solve most of the problems in about a week.

Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10

From: ADSGarson O'Toole<mailto:adsgarsonotoole at GMAIL.COM>
Sent: Wednesday, December 29, 2021 11:23 AM
To: ADS-L at LISTSERV.UGA.EDU<mailto:ADS-L at LISTSERV.UGA.EDU>
Subject: Research Topic: Estimating date of composition from a text passage

---------------------- Information from the mail header -----------------------
Sender: American Dialect Society <ADS-L at LISTSERV.UGA.EDU>
Poster: ADSGarson O'Toole <adsgarsonotoole at GMAIL.COM>
Subject: Research Topic: Estimating date of composition from a text
 passage
-------------------------------------------------------------------------------

The discussion of anachronisms in scripts reminded me of a suggestion
I made last year to a friend who is an artificial intelligence
researcher.

Too many books in the Google Books database have been assigned
incorrect dates. Also, too many newspaper pages in databases (such as
newspapers.com) have been assigned incorrect dates. It should be
possible to examine a text passage and assign an approximate date of
composition.

The frequencies of words, phrases, and grammatical constructs change
over time. Neural networks should be trainable to detect these types
of patterns. These patterns are complex and numerous. For example, the
two strings "Lunar Excursion Module" and "Neil Armstrong" are closely
associated with the moon landing in 1969. The two phrases together are
unlikely to occur in a passage written in the 1950s or before.

A program that can assign approximate dates could be used to flag
books and newspaper pages that have been assigned dates that are
suspiciously inconsistent with the text. The items that have been
flagged can be examined manually. The metadata in these databases can
be cleaned up.

Of course, there are complications because books from earlier times
are often reprinted. Also, passages within books are reprinted from
earlier times. So a single book may contain passages from multiple
time periods. But these are complications not show-stoppers.

My friend pointed to the following 2015 citation indicating that some
researchers have already explored this topic, but the new generation
of AI techniques are remarkably powerful. I wish some Google or OpenAI
researchers would work on this problem.

Title: Predicting Publication Date: a Text Analysis Exercise over
250,000 Volumes in the HTRC Secure HathiTrust Analytics Research
Commons
https://www.hathitrust.org/files/PlaleChen-HTRC-Analytics.pdf

[Begin excerpt]
The research question is this: can the body of a text be mined to
accurately predict book publication time where that information is
missing in the catalog record? The work is motivated by a non-trivial
number of catalog records in HathiTrust that are incomplete and in
some cases inaccurate. From the data set the authors used, a full 13%
of publish date values are missing.
[End excerpt]

Garson

------------------------------------------------------------
The American Dialect Society - http://www.americandialect.org

------------------------------------------------------------
The American Dialect Society - http://www.americandialect.org

------------------------------------------------------------
The American Dialect Society - http://www.americandialect.org