[Corpora-List] PDF Conversion
Kristofer Franzén
franzen at sics.se
Tue Mar 28 17:06:09 UTC 2006
Recently, I've tried to evaluate both commercial and free software for
pdf to text conversion, and I've come to the depressing conclusion that
there is really nothing better to find than Adobe Reader (6.0) Save as
Text... function.
But I don't think that I am familiar with the converter by Ted Briscoe's
group, mentioned by Hamish Cunningham in a reply to your post.
My experience is that you cannot find a tool that can handle 1. the
separation of figure and table captions from the running text 2. unusual
characters and symbols (greek, math) 3. the different ways of coding pdf.
Best,
Kristofer Franzén
Ken Litkowski wrote:
> Is anyone aware of free software that will process PDF documents into
> text streams? There is a PDF2HTML (with an XML option) that will
> create page-centric versions, but this does not really distinguish
> text from format. I want to ignore (or be able to treat separately)
> such things as headers, footnotes, tables, figures, and equations.
> (Note that even Google retains the page-centric view.)
>
> Thanks,
> Ken
More information about the Corpora
mailing list