[Ads-l] copy and paste OCR'd text

Ben Zimmer bgzimmer at GMAIL.COM
Tue Sep 5 17:25:57 EDT 2017


Dunno if this helps at all, but you can get the text of a page image as
Newspapers.com has OCR'ed it by using a "newspage" link. If the image link
is, say,

https://www.newspapers.com/image/25534415/

...then the link with OCR'ed text is:

https://www.newspapers.com/newspage/25534415/

That would obviously still require a lot of editing in most cases, but it
might be the fastest way to grab text.


On Tue, Sep 5, 2017 at 5:06 PM, MULLINS, WILLIAM D (Bill) CIV USARMY RDECOM
AMRDEC (US) <william.d.mullins18.civ at mail.mil> wrote:

> A friend of mine is writing a book about a turn of the century magician.
> He asked me the following:
>
> *********************
>
> I could use some help converting a ton of news articles into text that I
> can import into Pages or Word. The quality from Newspapers.com is readable,
> but not enough for OCR. Too many errors. Dictating or plain typing take
> forever. I've brought some to a few typists, but most people turn down the
> job if it involves laborious hand-typing these days!
>
> Thoughts? Is there a way to strip text off Newspapers.com, etc?
>
> *********************
>
> I responded:
>
> *********************
>
> I don't have any good ideas here.  I've tried importing images into Adobe
> Acrobat and doing OCR on them, but I've never had good results.  I think
> the problem is the contrast is low.  I can take an image into GIMP or
> Photoshop and boost the contrast, and get better results, but it's never
> good enough that I can simply cut and paste from the resulting image.  I
> always end up retyping.
>
> A trick that works occasionally is to see if I can find the same article
> in Newspaperarchive.com, which currently allows you to cut and paste from
> their native PDF files.  Unfortunately, they show that they are going to
> drop PDFs at the end of the month, and display only in JPGs.  Sometimes the
> Fulton database has duplicate articles, and you can cut and paste from
> their PDFs.  (but they often have the worst microfilm sources, and the
> OCR'd text has tons of errors.)  But both of these are dependent on the
> article showing up in the database, and that only happens sometimes.
>
> *********************
>
> (FWIW, I don't see any significant difference in OCR quality between using
> Newspaper.com's native "save a clip to a jpg" feature and using Microft's
> snipping tool to grab areas off the display.)
>
>
> Does anyone here have any better solutions?
>
> Thanks,
> Bill
>
>

------------------------------------------------------------
The American Dialect Society - http://www.americandialect.org


More information about the Ads-l mailing list