[Ads-l] Research Topic: Estimating date of composition from a text passage

Shapiro, Fred fred.shapiro at YALE.EDU
Wed Dec 29 21:20:00 UTC 2021


In my experience, and I am sure in the experience of others, most of the misdatings in databases such as Google Books and Internet Archive are due to their assigning many, many serial publications to the dating of the beginning of the serial rather than the dating of the specific source issue.

Fred Shapiro


________________________________
From: American Dialect Society <ADS-L at LISTSERV.UGA.EDU> on behalf of ADSGarson O'Toole <adsgarsonotoole at GMAIL.COM>
Sent: Wednesday, December 29, 2021 2:23 PM
To: ADS-L at LISTSERV.UGA.EDU <ADS-L at LISTSERV.UGA.EDU>
Subject: Research Topic: Estimating date of composition from a text passage

The discussion of anachronisms in scripts reminded me of a suggestion
I made last year to a friend who is an artificial intelligence
researcher.

Too many books in the Google Books database have been assigned
incorrect dates. Also, too many newspaper pages in databases (such as
newspapers.com) have been assigned incorrect dates. It should be
possible to examine a text passage and assign an approximate date of
composition.

The frequencies of words, phrases, and grammatical constructs change
over time. Neural networks should be trainable to detect these types
of patterns. These patterns are complex and numerous. For example, the
two strings "Lunar Excursion Module" and "Neil Armstrong" are closely
associated with the moon landing in 1969. The two phrases together are
unlikely to occur in a passage written in the 1950s or before.

A program that can assign approximate dates could be used to flag
books and newspaper pages that have been assigned dates that are
suspiciously inconsistent with the text. The items that have been
flagged can be examined manually. The metadata in these databases can
be cleaned up.

Of course, there are complications because books from earlier times
are often reprinted. Also, passages within books are reprinted from
earlier times. So a single book may contain passages from multiple
time periods. But these are complications not show-stoppers.

My friend pointed to the following 2015 citation indicating that some
researchers have already explored this topic, but the new generation
of AI techniques are remarkably powerful. I wish some Google or OpenAI
researchers would work on this problem.

Title: Predicting Publication Date: a Text Analysis Exercise over
250,000 Volumes in the HTRC Secure HathiTrust Analytics Research
Commons
https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.hathitrust.org%2Ffiles%2FPlaleChen-HTRC-Analytics.pdf&data=04%7C01%7Cfred.shapiro%40YALE.EDU%7C16f1d70cee79479f3a2808d9cb00b97c%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637764026679712816%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=I2I8gqZQRaFHf6rLy6%2Bz9eymVSiYCi2lxSH5kpmdaRQ%3D&reserved=0

[Begin excerpt]
The research question is this: can the body of a text be mined to
accurately predict book publication time where that information is
missing in the catalog record? The work is motivated by a non-trivial
number of catalog records in HathiTrust that are incomplete and in
some cases inaccurate. From the data set the authors used, a full 13%
of publish date values are missing.
[End excerpt]

Garson

------------------------------------------------------------
The American Dialect Society - https://nam12.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.americandialect.org%2F&data=04%7C01%7Cfred.shapiro%40YALE.EDU%7C16f1d70cee79479f3a2808d9cb00b97c%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637764026679712816%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=m4Ar3Fic6fouxTG8ww9Dpu22BEUTLhCiXLQsYLENLR0%3D&reserved=0

________________________________
From: American Dialect Society <ADS-L at LISTSERV.UGA.EDU> on behalf of ADSGarson O'Toole <adsgarsonotoole at GMAIL.COM>
Sent: Wednesday, December 29, 2021 2:23 PM
To: ADS-L at LISTSERV.UGA.EDU <ADS-L at LISTSERV.UGA.EDU>
Subject: Research Topic: Estimating date of composition from a text passage

The discussion of anachronisms in scripts reminded me of a suggestion
I made last year to a friend who is an artificial intelligence
researcher.

Too many books in the Google Books database have been assigned
incorrect dates. Also, too many newspaper pages in databases (such as
newspapers.com) have been assigned incorrect dates. It should be
possible to examine a text passage and assign an approximate date of
composition.

The frequencies of words, phrases, and grammatical constructs change
over time. Neural networks should be trainable to detect these types
of patterns. These patterns are complex and numerous. For example, the
two strings "Lunar Excursion Module" and "Neil Armstrong" are closely
associated with the moon landing in 1969. The two phrases together are
unlikely to occur in a passage written in the 1950s or before.

A program that can assign approximate dates could be used to flag
books and newspaper pages that have been assigned dates that are
suspiciously inconsistent with the text. The items that have been
flagged can be examined manually. The metadata in these databases can
be cleaned up.

Of course, there are complications because books from earlier times
are often reprinted. Also, passages within books are reprinted from
earlier times. So a single book may contain passages from multiple
time periods. But these are complications not show-stoppers.

My friend pointed to the following 2015 citation indicating that some
researchers have already explored this topic, but the new generation
of AI techniques are remarkably powerful. I wish some Google or OpenAI
researchers would work on this problem.

Title: Predicting Publication Date: a Text Analysis Exercise over
250,000 Volumes in the HTRC Secure HathiTrust Analytics Research
Commons
https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.hathitrust.org%2Ffiles%2FPlaleChen-HTRC-Analytics.pdf&data=04%7C01%7Cfred.shapiro%40YALE.EDU%7C16f1d70cee79479f3a2808d9cb00b97c%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637764026679712816%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=I2I8gqZQRaFHf6rLy6%2Bz9eymVSiYCi2lxSH5kpmdaRQ%3D&reserved=0

[Begin excerpt]
The research question is this: can the body of a text be mined to
accurately predict book publication time where that information is
missing in the catalog record? The work is motivated by a non-trivial
number of catalog records in HathiTrust that are incomplete and in
some cases inaccurate. From the data set the authors used, a full 13%
of publish date values are missing.
[End excerpt]

Garson

------------------------------------------------------------
The American Dialect Society - https://nam12.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.americandialect.org%2F&data=04%7C01%7Cfred.shapiro%40YALE.EDU%7C16f1d70cee79479f3a2808d9cb00b97c%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637764026679712816%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=m4Ar3Fic6fouxTG8ww9Dpu22BEUTLhCiXLQsYLENLR0%3D&reserved=0

------------------------------------------------------------
The American Dialect Society - http://www.americandialect.org



More information about the Ads-l mailing list