[Ads-l] copy and paste OCR'd text

Tue Sep 5 21:06:35 UTC 2017

A friend of mine is writing a book about a turn of the century magician.  He asked me the following:


I could use some help converting a ton of news articles into text that I can import into Pages or Word. The quality from Newspapers.com is readable, but not enough for OCR. Too many errors. Dictating or plain typing take forever. I've brought some to a few typists, but most people turn down the job if it involves laborious hand-typing these days!

Thoughts? Is there a way to strip text off Newspapers.com, etc?


I responded:


I don't have any good ideas here.  I've tried importing images into Adobe Acrobat and doing OCR on them, but I've never had good results.  I think the problem is the contrast is low.  I can take an image into GIMP or Photoshop and boost the contrast, and get better results, but it's never good enough that I can simply cut and paste from the resulting image.  I always end up retyping.

A trick that works occasionally is to see if I can find the same article in Newspaperarchive.com, which currently allows you to cut and paste from their native PDF files.  Unfortunately, they show that they are going to drop PDFs at the end of the month, and display only in JPGs.  Sometimes the Fulton database has duplicate articles, and you can cut and paste from their PDFs.  (but they often have the worst microfilm sources, and the OCR'd text has tons of errors.)  But both of these are dependent on the article showing up in the database, and that only happens sometimes.


(FWIW, I don't see any significant difference in OCR quality between using Newspaper.com's native "save a clip to a jpg" feature and using Microft's snipping tool to grab areas off the display.)

Does anyone here have any better solutions?


The American Dialect Society - http://www.americandialect.org

More information about the Ads-l mailing list