[Corpora-List] PDF Conversion
Tom Emerson
tree at basistech.com
Tue Mar 28 15:42:20 UTC 2006
Ken Litkowski writes:
> Is anyone aware of free software that will process PDF documents into
> text streams? There is a PDF2HTML (with an XML option) that will create
> page-centric versions, but this does not really distinguish text from
> format. I want to ignore (or be able to treat separately) such things
> as headers, footnotes, tables, figures, and equations. (Note that even
> Google retains the page-centric view.)
Given that PDF is a page-centric format, so you are unlikely to find
something that does what you are looking for: headers, footnotes,
tables, etc. are not going to be flagged from the surrounding content
in any special way.
--
Tom Emerson Basis Technology Corp.
Software Architect http://www.basistech.com
"You can't fake quality any more than you can fake a good meal." (W.S.B.)
More information about the Corpora
mailing list