[Corpora-List] PDF Conversion
Constantin Orasan
C.Orasan at wlv.ac.uk
Tue Mar 28 17:09:58 UTC 2006
Hi,
> Is anyone aware of free software that will process PDF documents into
> text streams? There is a PDF2HTML (with an XML option) that will create
> page-centric versions, but this does not really distinguish text from
> format. I want to ignore (or be able to treat separately) such things
> as headers, footnotes, tables, figures, and equations. (Note that even
> Google retains the page-centric view.)
There was a thread on corpora list about conversion of PDF file in 2001.
Here are the links:
http://torvald.aksis.uib.no/corpora/2001-2/0133.html
and a summary of the answers:
http://torvald.aksis.uib.no/corpora/2001-4/0257.html
However, I doubt any of these programs will solve your problem. All the
programs I have used really break the text in pages. In some cases you
can write some post-processors to identify footnotes and things like
this, but very often they are formatting dependent (i.e. they will work
well only on documents from the same source - e.g. journal articles by a
publisher).
Regards,
Constantin
--
Constantin Orasan <C.Orasan at wlv.ac.uk>
http://www.wlv.ac.uk/~in6093/
Lecturer in Computational Linguistics
Research Group in Computational Linguistics
University of Wolverhampton
More information about the Corpora
mailing list