[Corpora-List] PDF Conversion

Kristofer Franzén franzen at sics.se
Tue Mar 28 17:06:09 UTC 2006


Recently, I've tried to evaluate both commercial and free software for 
pdf to text conversion, and I've come to the depressing conclusion that 
there is really nothing better to find than Adobe Reader (6.0) Save as 
Text... function.

But I don't think that I am familiar with the converter by Ted Briscoe's 
group, mentioned by Hamish Cunningham in a reply to your post.

My experience is that you cannot find a tool that can handle 1. the 
separation of figure and table captions from the running text 2. unusual 
characters and symbols (greek, math) 3. the different ways of coding pdf.

Best,

Kristofer Franzén



Ken Litkowski wrote:

> Is anyone aware of free software that will process PDF documents into 
> text streams?  There is a PDF2HTML (with an XML option) that will 
> create page-centric versions, but this does not really distinguish 
> text from format.  I want to ignore (or be able to treat separately) 
> such things as headers, footnotes, tables, figures, and equations.  
> (Note that even Google retains the page-centric view.)
>
> Thanks,
>     Ken



More information about the Corpora mailing list