[Corpora-List] PDF Conversion

Tue Mar 28 17:09:58 UTC 2006

Hi,

> Is anyone aware of free software that will process PDF documents into 
> text streams?  There is a PDF2HTML (with an XML option) that will create 
> page-centric versions, but this does not really distinguish text from 
> format.  I want to ignore (or be able to treat separately) such things 
> as headers, footnotes, tables, figures, and equations.  (Note that even 
> Google retains the page-centric view.)
There was a thread on corpora list about conversion of PDF file in 2001.
Here are the links:
http://torvald.aksis.uib.no/corpora/2001-2/0133.html
and a summary of the answers:
http://torvald.aksis.uib.no/corpora/2001-4/0257.html

However, I doubt any of these programs will solve your problem. All the
programs I have used really break the text in pages. In some cases you
can write some post-processors to identify footnotes and things like
this, but very often they are formatting dependent (i.e. they will work
well only on documents from the same source - e.g. journal articles by a
publisher).

Regards,

Constantin

-- 
Constantin Orasan <C.Orasan at wlv.ac.uk>
http://www.wlv.ac.uk/~in6093/
Lecturer in Computational Linguistics
Research Group in Computational Linguistics
University of Wolverhampton