[Ads-l] Databases for Historical U.S. Searching

ADSGarson O'Toole adsgarsonotoole at GMAIL.COM
Sun Jul 3 19:49:23 UTC 2022


Congratulations on building such an impressive personal digital
archive, and thanks for sharing your knowledge, GB.

I have also built a personal digital archive, but it is not so
extraordinary. During the past decade I have acquired numerous books
for research such as “A New Dictionary of Quotations on Historical
Principles from Ancient and Modern Sources” compiled by H. L. Mencken
and “The Quotable Voltaire” compiled by Garry Apgar and Edward M.
Langille. To allow quick and efficient access to these books I scan
them and construct searchable PDFs.

Microsoft Windows nowadays automatically builds a search index for the
contents of a folder. So I can simultaneously search many books for a
quoted phrase in my digital archive. Of course, FoxTrot Professional
Search would probably build a superior index (on a Mac) and enable
more sophisticated queries.

Issues of copyright are complex and sensitive. I own the books
mentioned above, and the archive has been created for personal
research use, so I think this activity qualifies as fair use with
respect to copyright.

A searchable personal digital archive would be helpful to many
researchers, e.g., historians, biographers, and linguists. The archive
would include all the researchers previous notes together with books,
clippings, and other documents. Are strategies for constructing and
maintaining this type of archive being taught to the current
generation?

Massive newspaper, book, and journal archives of today require
hundreds of terabytes of storage. A 2018 article stated that ProQuest
was planning to construct a long-term archival storage with 600
terabytes of data. In the future, petabytes and more will be employed.
I’ve prepared a Dad joke: yottabytes is a lotta bytes.

https://librarytechnology.org/pr/23231

Interestingly, Fultonhistory.com is operated by one person, Tom
Tryniski, who purchased a microfilm scanner.

Garson

On Sat, Jul 2, 2022 at 4:40 PM Grant Barrett <gbarrett at worldnewyork.org> wrote:
>
> That would be a wonderful situation! I think pigs will get their pilot
> licenses first, though.
>
> I long since stopped waiting for others to make a searchable solution for
> me. When I find something useful, I've been saving it locally for more than
> 20 years. I realize that doesn't help when you are searching something new,
> but it does help when you are revisiting old questions.
>
> Besides print reference works, I have 870GB+ of digital data indexed by a
> program called FoxTrot Professional Search. It allows complex searching
> using Boolean- and regex-style operators on a wide variety of formats
> including email, HTML, XML, PDF, and Word. Just being able to search for
> "WORD1 within 10 words of WORD2" is huge, as well as having proper
> wildcards, and *no* stop words.
>
> https://foxtrot-search.com/
>
> As for OCR, I've spent the last couple of years trying out all of the open
> source OCR packages I could find (and could make work) and many of the paid
> ones. I use them to handle screen grabs and public domain files I download
> from Archive.org and elsewhere and then index on my own computer (since the
> OCR done automatically by the Internet Archive is often poorly tuned).
>
> The best open source OCR package for my purposes is OCRmyPDF, which uses
> the Tesseract OCR engine but also handles things like improving
> page contrast, deskewing, despeckling, file size reduction, and other
> cleanup that Tesseract does not. Both OCRmyPDF and Tesseract are under
> active development.
>
> https://github.com/ocrmypdf/OCRmyPDF
> https://github.com/tesseract-ocr/tesseract
>
> I have also settled on the commercial software AABBYY FineReader
> https://pdf.abbyy.com/, which gives better OCR results than OCRmyPDF, but
> the latest Mac version is lacking scripting features and doesn't lend
> itself to batch mode, so it requires more hand-holding. I use this for
> specific documents when I want to be certain about not missing things in
> them (for example, the English Dialect Dictionary, whose online interface
> at https://eddonline-proj.uibk.ac.at/edd/index.jsp# doesn't allow as
> sophisticated searches as I would like). I am contemplating trying the
> Windows version in virtualization to see if that is worth the trouble and
> cost.
>
> I also paid a programmer to fork OCRmyPDF to give it the option to use
> Google's OCR engine instead of Tesseract. That fork is here:
> https://github.com/ualiawan/OCRmyPDF. It's more fiddly than the regular
> OCRmyPDF, and it requires a Google Cloud Vision account (which charges some
> fraction of a cent for each page OCRed), but it works well, and in some
> cases may produce better results than OCRmyPDF, although you must be sure
> to specify the language of the document.
>
> GB
>
> On Fri, Jul 1, 2022 at 7:15 PM ADSGarson O'Toole <adsgarsonotoole at gmail.com>
> wrote:
>
> > Fred said:
> > >> If anyone knows of specific state projects that have a lot of material
> > >> not included in Chronicling America, I would love to hear about them.
> >
> > Thanks to everyone who has commented on this thread.
> >
> > Having to visit fifty separate state databases and deal with fifty
> > separate user interfaces would be a nightmare. Organizations of
> > historians, librarians, archivists, and linguists (including the
> > American Dialect Association) should be pushing to improve this
> > situation.
> >
> > Archivists employed by U.S. states should be coordinating with one
> > another to create a standard format for scans and metadata derived
> > from books, magazines, journals, newspapers and other documents. These
> > scans should be aggregated into a single searchable database with a
> > high-quality user interface.
> >
> > There should be ongoing research to create the best possible optical
> > character recognition (OCR) engine. The OCR engine should be
> > maintained as an open source piece of software available to all.
> > Periodically, the  scans should be processed by the latest-best OCR
> > engine and a comprehensive index of all the text should be
> > constructed.
> >
> > Is the Internet Archive doing this? Is HathiTrust doing this? Is
> > Chronicling America (Library of Congress) doing this?
> >
> > Of course, it is easy to point out what should be done and to make
> > demands. But the current situation is aggravating because it is
> > extraordinarily wasteful.
> >
> > Garson
> >
> > ------------------------------------------------------------
> > The American Dialect Society - http://www.americandialect.org
> >
>
> ------------------------------------------------------------
> The American Dialect Society - http://www.americandialect.org

------------------------------------------------------------
The American Dialect Society - http://www.americandialect.org


More information about the Ads-l mailing list