[Ads-l] Databases for Historical U.S. Searching
Grant Barrett
gbarrett at WORLDNEWYORK.ORG
Sat Jul 2 20:40:15 UTC 2022
That would be a wonderful situation! I think pigs will get their pilot
licenses first, though.
I long since stopped waiting for others to make a searchable solution for
me. When I find something useful, I've been saving it locally for more than
20 years. I realize that doesn't help when you are searching something new,
but it does help when you are revisiting old questions.
Besides print reference works, I have 870GB+ of digital data indexed by a
program called FoxTrot Professional Search. It allows complex searching
using Boolean- and regex-style operators on a wide variety of formats
including email, HTML, XML, PDF, and Word. Just being able to search for
"WORD1 within 10 words of WORD2" is huge, as well as having proper
wildcards, and *no* stop words.
https://foxtrot-search.com/
As for OCR, I've spent the last couple of years trying out all of the open
source OCR packages I could find (and could make work) and many of the paid
ones. I use them to handle screen grabs and public domain files I download
from Archive.org and elsewhere and then index on my own computer (since the
OCR done automatically by the Internet Archive is often poorly tuned).
The best open source OCR package for my purposes is OCRmyPDF, which uses
the Tesseract OCR engine but also handles things like improving
page contrast, deskewing, despeckling, file size reduction, and other
cleanup that Tesseract does not. Both OCRmyPDF and Tesseract are under
active development.
https://github.com/ocrmypdf/OCRmyPDF
https://github.com/tesseract-ocr/tesseract
I have also settled on the commercial software AABBYY FineReader
https://pdf.abbyy.com/, which gives better OCR results than OCRmyPDF, but
the latest Mac version is lacking scripting features and doesn't lend
itself to batch mode, so it requires more hand-holding. I use this for
specific documents when I want to be certain about not missing things in
them (for example, the English Dialect Dictionary, whose online interface
at https://eddonline-proj.uibk.ac.at/edd/index.jsp# doesn't allow as
sophisticated searches as I would like). I am contemplating trying the
Windows version in virtualization to see if that is worth the trouble and
cost.
I also paid a programmer to fork OCRmyPDF to give it the option to use
Google's OCR engine instead of Tesseract. That fork is here:
https://github.com/ualiawan/OCRmyPDF. It's more fiddly than the regular
OCRmyPDF, and it requires a Google Cloud Vision account (which charges some
fraction of a cent for each page OCRed), but it works well, and in some
cases may produce better results than OCRmyPDF, although you must be sure
to specify the language of the document.
GB
On Fri, Jul 1, 2022 at 7:15 PM ADSGarson O'Toole <adsgarsonotoole at gmail.com>
wrote:
> Fred said:
> >> If anyone knows of specific state projects that have a lot of material
> >> not included in Chronicling America, I would love to hear about them.
>
> Thanks to everyone who has commented on this thread.
>
> Having to visit fifty separate state databases and deal with fifty
> separate user interfaces would be a nightmare. Organizations of
> historians, librarians, archivists, and linguists (including the
> American Dialect Association) should be pushing to improve this
> situation.
>
> Archivists employed by U.S. states should be coordinating with one
> another to create a standard format for scans and metadata derived
> from books, magazines, journals, newspapers and other documents. These
> scans should be aggregated into a single searchable database with a
> high-quality user interface.
>
> There should be ongoing research to create the best possible optical
> character recognition (OCR) engine. The OCR engine should be
> maintained as an open source piece of software available to all.
> Periodically, the scans should be processed by the latest-best OCR
> engine and a comprehensive index of all the text should be
> constructed.
>
> Is the Internet Archive doing this? Is HathiTrust doing this? Is
> Chronicling America (Library of Congress) doing this?
>
> Of course, it is easy to point out what should be done and to make
> demands. But the current situation is aggravating because it is
> extraordinarily wasteful.
>
> Garson
>
> ------------------------------------------------------------
> The American Dialect Society - http://www.americandialect.org
>
------------------------------------------------------------
The American Dialect Society - http://www.americandialect.org
More information about the Ads-l
mailing list