[Corpora-List] PDF Conversion
Victor Kapustin
victor.kapustin at gmail.com
Wed Mar 29 14:34:57 UTC 2006
Ken,
> Is anyone aware of free software that will process PDF documents into
> text streams? There is a PDF2HTML (with an XML option) that will
> create page-centric versions, but this does not really distinguish
> text from format. I want to ignore (or be able to treat separately)
> such things as headers, footnotes, tables, figures, and equations.
> (Note that even Google retains the page-centric view.)
gsview: http://www.cs.wisc.edu/~ghost/gsview/index.htm - includes pstotext
For batch processing: pstotext - extracting plain text from PostScript:
http://www.cs.wisc.edu/~ghost/doc/pstotext.htm
Both require GhostScript (http://www.cs.wisc.edu/~ghost/doc/AFPL/get853.htm)
For me they do good job, though equations (and text fragmrnts like plot axes
marks) are polluting the text.
--
Victor Kapustin
Saint-Petersburg State Univ.
Russia
More information about the Corpora
mailing list