[Corpora-List] PDF Conversion

Wed Mar 29 11:54:49 UTC 2006

Hello everybody,
there is a commercial OCR package, FineReader, which can read a PDF file, no
matter whether it is text or bitmap (so you do not need the conversion into
bitmaps). It is not very efficient when you have a very complex layout (e.g.
a tabloid) but otherwise performs quite well. It tries hard to reproduce the
formatting of PDF pages, including headers and the like.
Best wishes,
Tadeusz Piotrowski

> -----Original Message-----
> From: owner-corpora at lists.uib.no 
> [mailto:owner-corpora at lists.uib.no] On Behalf Of Brett Powley
> Sent: Wednesday, March 29, 2006 5:12 AM
> To: Ken Litkowski
> Cc: corpora at hd.uib.no
> Subject: Re: [Corpora-List] PDF Conversion
> 
> Hi Ken,
> 
> The work I have been doing (with the ACL anthology) involves 
> doing precisely this.  I spent some time evaluating tools to 
> do it, including:
> 
> Adobe Reader (using Save as Text...)
> Multivalent (Java, open source)
> PDFBox (Java, open source)
> XPDF (open source)
> Etymon Pjx (open source)
> PDFTextStream (commercial)
> JPedal (commercial)
> Argus (commercial)
> 3-heights PDF extract (pdf-tools) (commercial)
> 
> (I also looked to see whether Mac OS X provided any API for 
> text extraction since it has built-in PDF support and it 
> indexes PDF documents, but if there is an API it's not a 
> public one yet.)
> 
> The one that gave the best performance was PDFBox (open 
> source, Java), but among the ones that performed well, there 
> really wasn't much in it.
> 
> There are two major issues in PDF extraction:
> 
> (1) Page layout -- footnotes, columns, etc. PDF is (was) 
> designed to provide an accurate on screen or printed 
> rendering of a document (it's essentially a special version 
> of PostScript), so getting the  
> text back out wasn't an issue for the original designers at least.   
> This means in theory that the text can appear in the file in 
> any arbitrary order (the order in which it's drawn), though 
> in practice it tends to be in a somewhat sensible order -- 
> the text tends to be in order, and columns tend to be OK too. 
>  Footnotes, headers, and footers, however are a more 
> difficult problem.
> 
> (2) Font encoding -- when a PDF document uses an embedded 
> font subset, the mapping between the character codes used for 
> characters and what characters they represent is generally 
> unknown.  The document essentially looks like "draw character 
> X here" where X points to the glyph which should be drawn, 
> but is otherwise arbitrary.
> All of the tools above failed on documents with embedded 
> fonts in the same way.  For the ACL anthology, this seemed to 
> affect about 40% of the documents.
> One solution to this problem (albeit not a very elegant one) 
> is to render the PDF documents with font encoding as bitmaps, 
> and then run OCR on them.
> 
> Hope this helps,
> 
> Brett
> 
> 
> On 29/03/2006, at 2:35 AM, Ken Litkowski wrote:
> 
> > Is anyone aware of free software that will process PDF 
> documents into 
> > text streams?  There is a PDF2HTML (with an XML option) that will 
> > create page-centric versions, but this does not really distinguish 
> > text from format.  I want to ignore (or be able to treat 
> separately) 
> > such things as headers, footnotes, tables, figures, and equations.  
> > (Note that even Google retains the page- centric view.)
> >
> > Thanks,
> > 	Ken
> > -- 
> > Ken Litkowski                     TEL.: 301-482-0237
> > CL Research                       EMAIL: ken at clres.com
> > 9208 Gue Road
> > Damascus, MD 20872-1025 USA       Home Page: http://www.clres.com
> >
> >
> >
> 
> 
> 
> --------------------------------------------------------------
> Brett Powley -- PhD Candidate
> Centre for Language Technology, Macquarie University,  Australia
> p: +61-402-013050    f: +61-2-90120813    e: bpowley at ics.mq.edu.au
> w: http://www.ics.mq.edu.au/~bpowley
> faciendi plures libros nullus est finis
> frequensque meditatio carnis adflictio est
> --------------------------------------------------------------
> 
> 
> 
> 
>