[Corpora-List] PDF Conversion

Wed Mar 29 03:11:33 UTC 2006

Hi Ken,

The work I have been doing (with the ACL anthology) involves doing  
precisely this.  I spent some time evaluating tools to do it, including:

Adobe Reader (using Save as Text...)
Multivalent (Java, open source)
PDFBox (Java, open source)
XPDF (open source)
Etymon Pjx (open source)
PDFTextStream (commercial)
JPedal (commercial)
Argus (commercial)
3-heights PDF extract (pdf-tools) (commercial)

(I also looked to see whether Mac OS X provided any API for text  
extraction since it has built-in PDF support and it indexes PDF  
documents, but if there is an API it's not a public one yet.)

The one that gave the best performance was PDFBox (open source,  
Java), but among the ones that performed well, there really wasn't  
much in it.

There are two major issues in PDF extraction:

(1) Page layout -- footnotes, columns, etc. PDF is (was) designed to  
provide an accurate on screen or printed rendering of a document  
(it's essentially a special version of PostScript), so getting the  
text back out wasn't an issue for the original designers at least.   
This means in theory that the text can appear in the file in any  
arbitrary order (the order in which it's drawn), though in practice  
it tends to be in a somewhat sensible order -- the text tends to be  
in order, and columns tend to be OK too.  Footnotes, headers, and  
footers, however are a more difficult problem.

(2) Font encoding -- when a PDF document uses an embedded font  
subset, the mapping between the character codes used for characters  
and what characters they represent is generally unknown.  The  
document essentially looks like "draw character X here" where X  
points to the glyph which should be drawn, but is otherwise arbitrary.
All of the tools above failed on documents with embedded fonts in the  
same way.  For the ACL anthology, this seemed to affect about 40% of  
the documents.
One solution to this problem (albeit not a very elegant one) is to  
render the PDF documents with font encoding as bitmaps, and then run  
OCR on them.

Hope this helps,

Brett

On 29/03/2006, at 2:35 AM, Ken Litkowski wrote:

> Is anyone aware of free software that will process PDF documents  
> into text streams?  There is a PDF2HTML (with an XML option) that  
> will create page-centric versions, but this does not really  
> distinguish text from format.  I want to ignore (or be able to  
> treat separately) such things as headers, footnotes, tables,  
> figures, and equations.  (Note that even Google retains the page- 
> centric view.)
>
> Thanks,
> 	Ken
> -- 
> Ken Litkowski                     TEL.: 301-482-0237
> CL Research                       EMAIL: ken at clres.com
> 9208 Gue Road
> Damascus, MD 20872-1025 USA       Home Page: http://www.clres.com
>
>
>

--------------------------------------------------------------
Brett Powley -- PhD Candidate
Centre for Language Technology, Macquarie University,  Australia
p: +61-402-013050    f: +61-2-90120813    e: bpowley at ics.mq.edu.au
w: http://www.ics.mq.edu.au/~bpowley
faciendi plures libros nullus est finis
frequensque meditatio carnis adflictio est
--------------------------------------------------------------