[Ads-l] Research Topic: Estimating date of composition from a text passage

Wed Dec 29 20:12:50 UTC 2021

Also, even if all the words are “period,” the actual dialogue, events depicted, clothing worn, time of day, lighting, scenery, props, and sequence of events is all fictional anyways, so it’s not a disaster if they use a word we understand that they wouldn’t have been familiar with.

Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10

From: ADSGarson O'Toole<mailto:adsgarsonotoole at GMAIL.COM>
Sent: Wednesday, December 29, 2021 11:23 AM
To: ADS-L at LISTSERV.UGA.EDU<mailto:ADS-L at LISTSERV.UGA.EDU>
Subject: Research Topic: Estimating date of composition from a text passage

---------------------- Information from the mail header -----------------------
Sender:       American Dialect Society <ADS-L at LISTSERV.UGA.EDU>
Poster:       ADSGarson O'Toole <adsgarsonotoole at GMAIL.COM>
Subject:      Research Topic: Estimating date of composition from a text
              passage
-------------------------------------------------------------------------------

The discussion of anachronisms in scripts reminded me of a suggestion
I made last year to a friend who is an artificial intelligence
researcher.

Too many books in the Google Books database have been assigned
incorrect dates. Also, too many newspaper pages in databases (such as
newspapers.com) have been assigned incorrect dates. It should be
possible to examine a text passage and assign an approximate date of
composition.

The frequencies of words, phrases, and grammatical constructs change
over time. Neural networks should be trainable to detect these types
of patterns. These patterns are complex and numerous. For example, the
two strings "Lunar Excursion Module" and "Neil Armstrong" are closely
associated with the moon landing in 1969. The two phrases together are
unlikely to occur in a passage written in the 1950s or before.

A program that can assign approximate dates could be used to flag
books and newspaper pages that have been assigned dates that are
suspiciously inconsistent with the text. The items that have been
flagged can be examined manually. The metadata in these databases can
be cleaned up.

Of course, there are complications because books from earlier times
are often reprinted. Also, passages within books are reprinted from
earlier times. So a single book may contain passages from multiple
time periods. But these are complications not show-stoppers.

My friend pointed to the following 2015 citation indicating that some
researchers have already explored this topic, but the new generation
of AI techniques are remarkably powerful. I wish some Google or OpenAI
researchers would work on this problem.

Title: Predicting Publication Date: a Text Analysis Exercise over
250,000 Volumes in the HTRC Secure HathiTrust Analytics Research
Commons
https://www.hathitrust.org/files/PlaleChen-HTRC-Analytics.pdf

[Begin excerpt]
The research question is this: can the body of a text be mined to
accurately predict book publication time where that information is
missing in the catalog record? The work is motivated by a non-trivial
number of catalog records in HathiTrust that are incomplete and in
some cases inaccurate. From the data set the authors used, a full 13%
of publish date values are missing.
[End excerpt]

Garson

------------------------------------------------------------
The American Dialect Society - http://www.americandialect.org

------------------------------------------------------------
The American Dialect Society - http://www.americandialect.org