[Corpora-List] converting PDFs to ASCII or text-only files without clumps

John F. Sowa sowa at bestweb.net
Thu Jun 17 00:39:41 UTC 2010


On 6/16/2010 5:58 PM, maxwell wrote:
> Not for me, I have no OCR software.  This is just selecting text with a
> mouse and copying it to the clipboard.  It happens on several PDF viewers.

This is not a problem caused by the PDF viewers, but by the PDF
generators.  Some PDF sources generate clean, sequential text
with all the blanks and all the paragraphs in the right sequence.

But the PDF specifications allow items placed on a page to be
created in arbitrary order and be placed into designated areas
on a page.  Problems are more likely to occur on pages that have
multi-column text or pictures in the middle of a page or column.
But certain kinds of generators are notorious for creating such
scrambled PDF files.

Unless you have very smart PDF-to-text software, the result can
be horribly scrambled.  If you don't have such software (and such
software is mostly experimental) it may be simpler to generate
a print image and use an OCR system to convert it back to text.

John Sowa

PS:  Disclaimer:  I am not an expert on PDF techniques, but I
have seen such problems, and I have talked with people who are
familiar with them.  So please don't ask me for a solution.



_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list