[Corpora-List] PDF Conversion

Tom Emerson tree at basistech.com
Tue Mar 28 15:42:20 UTC 2006


Ken Litkowski writes:
> Is anyone aware of free software that will process PDF documents into 
> text streams?  There is a PDF2HTML (with an XML option) that will create 
> page-centric versions, but this does not really distinguish text from 
> format.  I want to ignore (or be able to treat separately) such things 
> as headers, footnotes, tables, figures, and equations.  (Note that even 
> Google retains the page-centric view.)

Given that PDF is a page-centric format, so you are unlikely to find
something that does what you are looking for: headers, footnotes,
tables, etc. are not going to be flagged from the surrounding content
in any special way.

-- 
Tom Emerson                                          Basis Technology Corp.
Software Architect                                 http://www.basistech.com
 "You can't fake quality any more than you can fake a good meal." (W.S.B.)



More information about the Corpora mailing list