[Corpora-List] PDF Conversion

Alexander Osherenko osherenko at gmx.de
Tue Mar 28 15:57:26 UTC 2006


Hi Ken,

I worked with the PDF2HTML tool and my experience is that although it is 
a free software you still pay by losing your time and temper :) - the 
tool produces vague and not very exact results (wrong order of HTML tags 
or footnotes, wrong HTML tags e.g. <b><i><\b><\i> to name one). 
Nevertheless after you finished your first experiments with the tool you 
may find that you are a really mighty expert in PDF, HTML, PDF2HTML 
whatsoever and the tool is actually not so bad...

Sorry if my answer is something confusing but I hope it helps.

Cheers

Alexander

Ken Litkowski schrieb:

> Is anyone aware of free software that will process PDF documents into 
> text streams?  There is a PDF2HTML (with an XML option) that will 
> create page-centric versions, but this does not really distinguish 
> text from format.  I want to ignore (or be able to treat separately) 
> such things as headers, footnotes, tables, figures, and equations.  
> (Note that even Google retains the page-centric view.)
>
> Thanks,
>     Ken



More information about the Corpora mailing list