[Ads-l] copy and paste OCR'd text

ADSGarson O'Toole adsgarsonotoole at GMAIL.COM
Tue Sep 5 23:10:16 UTC 2017


Thanks for this tip, Ben.
Garson

On Tue, Sep 5, 2017 at 5:25 PM, Ben Zimmer <bgzimmer at gmail.com> wrote:
> Dunno if this helps at all, but you can get the text of a page image as
> Newspapers.com has OCR'ed it by using a "newspage" link. If the image link
> is, say,
>
> https://www.newspapers.com/image/25534415/
>
> ...then the link with OCR'ed text is:
>
> https://www.newspapers.com/newspage/25534415/
>
> That would obviously still require a lot of editing in most cases, but it
> might be the fastest way to grab text.
>
>
> On Tue, Sep 5, 2017 at 5:06 PM, MULLINS, WILLIAM D (Bill) CIV USARMY RDECOM
> AMRDEC (US) <william.d.mullins18.civ at mail.mil> wrote:
>
>> A friend of mine is writing a book about a turn of the century magician.
>> He asked me the following:
>>
>> *********************
>>
>> I could use some help converting a ton of news articles into text that I
>> can import into Pages or Word. The quality from Newspapers.com is readable,
>> but not enough for OCR. Too many errors. Dictating or plain typing take
>> forever. I've brought some to a few typists, but most people turn down the
>> job if it involves laborious hand-typing these days!
>>
>> Thoughts? Is there a way to strip text off Newspapers.com, etc?
>>
>> *********************
>>
>> I responded:
>>
>> *********************
>>
>> I don't have any good ideas here.  I've tried importing images into Adobe
>> Acrobat and doing OCR on them, but I've never had good results.  I think
>> the problem is the contrast is low.  I can take an image into GIMP or
>> Photoshop and boost the contrast, and get better results, but it's never
>> good enough that I can simply cut and paste from the resulting image.  I
>> always end up retyping.
>>
>> A trick that works occasionally is to see if I can find the same article
>> in Newspaperarchive.com, which currently allows you to cut and paste from
>> their native PDF files.  Unfortunately, they show that they are going to
>> drop PDFs at the end of the month, and display only in JPGs.  Sometimes the
>> Fulton database has duplicate articles, and you can cut and paste from
>> their PDFs.  (but they often have the worst microfilm sources, and the
>> OCR'd text has tons of errors.)  But both of these are dependent on the
>> article showing up in the database, and that only happens sometimes.
>>
>> *********************
>>
>> (FWIW, I don't see any significant difference in OCR quality between using
>> Newspaper.com's native "save a clip to a jpg" feature and using Microft's
>> snipping tool to grab areas off the display.)
>>
>>
>> Does anyone here have any better solutions?
>>
>> Thanks,
>> Bill
>>
>>
>
> ------------------------------------------------------------
> The American Dialect Society - http://www.americandialect.org

------------------------------------------------------------
The American Dialect Society - http://www.americandialect.org



More information about the Ads-l mailing list