[Ads-l] Research Topic: Estimating date of composition from a text passage
adsgarsonotoole at GMAIL.COM
Wed Dec 29 19:23:30 UTC 2021
The discussion of anachronisms in scripts reminded me of a suggestion
I made last year to a friend who is an artificial intelligence
Too many books in the Google Books database have been assigned
incorrect dates. Also, too many newspaper pages in databases (such as
newspapers.com) have been assigned incorrect dates. It should be
possible to examine a text passage and assign an approximate date of
The frequencies of words, phrases, and grammatical constructs change
over time. Neural networks should be trainable to detect these types
of patterns. These patterns are complex and numerous. For example, the
two strings "Lunar Excursion Module" and "Neil Armstrong" are closely
associated with the moon landing in 1969. The two phrases together are
unlikely to occur in a passage written in the 1950s or before.
A program that can assign approximate dates could be used to flag
books and newspaper pages that have been assigned dates that are
suspiciously inconsistent with the text. The items that have been
flagged can be examined manually. The metadata in these databases can
be cleaned up.
Of course, there are complications because books from earlier times
are often reprinted. Also, passages within books are reprinted from
earlier times. So a single book may contain passages from multiple
time periods. But these are complications not show-stoppers.
My friend pointed to the following 2015 citation indicating that some
researchers have already explored this topic, but the new generation
of AI techniques are remarkably powerful. I wish some Google or OpenAI
researchers would work on this problem.
Title: Predicting Publication Date: a Text Analysis Exercise over
250,000 Volumes in the HTRC Secure HathiTrust Analytics Research
The research question is this: can the body of a text be mined to
accurately predict book publication time where that information is
missing in the catalog record? The work is motivated by a non-trivial
number of catalog records in HathiTrust that are incomplete and in
some cases inaccurate. From the data set the authors used, a full 13%
of publish date values are missing.
The American Dialect Society - http://www.americandialect.org
More information about the Ads-l