[Ads-l] copy and paste OCR'd text

MULLINS, WILLIAM D (Bill) CIV USARMY RDECOM AMRDEC (US) william.d.mullins18.civ at MAIL.MIL
Tue Sep 5 17:40:14 EDT 2017


Thanks, Ben.  This is news to me.  Even if it doesn't help my friend, I'm sure I'll be able to take advantage of it.



> ----
> 
> Dunno if this helps at all, but you can get the text of a page image as Newspapers.com has OCR'ed it by using a "newspage" link. If the
> image link is, say,
> 
> https://www.newspapers.com/image/25534415/
> 
> ...then the link with OCR'ed text is:
> 
> https://www.newspapers.com/newspage/25534415/
> 
> That would obviously still require a lot of editing in most cases, but it might be the fastest way to grab text.
> 
> 
> On Tue, Sep 5, 2017 at 5:06 PM, MULLINS, WILLIAM D (Bill) CIV USARMY RDECOM AMRDEC (US) <william.d.mullins18.civ at mail.mil> wrote:
> 
> > A friend of mine is writing a book about a turn of the century magician.
> > He asked me the following:
> >
> > *********************
> >
> > I could use some help converting a ton of news articles into text that
> > I can import into Pages or Word. The quality from Newspapers.com is
> > readable, but not enough for OCR. Too many errors. Dictating or plain
> > typing take forever. I've brought some to a few typists, but most
> > people turn down the job if it involves laborious hand-typing these days!
> >
> > Thoughts? Is there a way to strip text off Newspapers.com, etc?
> >
> > *********************
> >
> > I responded:
> >
> > *********************
> >
> > I don't have any good ideas here.  I've tried importing images into
> > Adobe Acrobat and doing OCR on them, but I've never had good results.
> > I think the problem is the contrast is low.  I can take an image into
> > GIMP or Photoshop and boost the contrast, and get better results, but
> > it's never good enough that I can simply cut and paste from the
> > resulting image.  I always end up retyping.
> >
> > A trick that works occasionally is to see if I can find the same
> > article in Newspaperarchive.com, which currently allows you to cut and
> > paste from their native PDF files.  Unfortunately, they show that they
> > are going to drop PDFs at the end of the month, and display only in
> > JPGs.  Sometimes the Fulton database has duplicate articles, and you
> > can cut and paste from their PDFs.  (but they often have the worst
> > microfilm sources, and the OCR'd text has tons of errors.)  But both
> > of these are dependent on the article showing up in the database, and that only happens sometimes.
> >
> > *********************
> >
> > (FWIW, I don't see any significant difference in OCR quality between
> > using Newspaper.com's native "save a clip to a jpg" feature and using
> > Microft's snipping tool to grab areas off the display.)
> >
> >
> > Does anyone here have any better solutions?
> >
> > Thanks,
> > Bill
> >
> >
> 
> ------------------------------------------------------------
> The American Dialect Society - Caution-http://www.americandialect.org

------------------------------------------------------------
The American Dialect Society - http://www.americandialect.org


More information about the Ads-l mailing list