[Corpora-List] PDF Conversion

Victor Kapustin victor.kapustin at gmail.com
Wed Mar 29 14:34:57 UTC 2006


Ken, 

> Is anyone aware of free software that will process PDF documents into 
> text streams?  There is a PDF2HTML (with an XML option) that will 
> create page-centric versions, but this does not really distinguish 
> text from format.  I want to ignore (or be able to treat separately) 
> such things as headers, footnotes, tables, figures, and equations.  
> (Note that even Google retains the page-centric view.)
gsview: http://www.cs.wisc.edu/~ghost/gsview/index.htm - includes pstotext

For batch processing: pstotext - extracting plain text from PostScript:
http://www.cs.wisc.edu/~ghost/doc/pstotext.htm

Both require GhostScript (http://www.cs.wisc.edu/~ghost/doc/AFPL/get853.htm)

For me they do good job, though equations (and text fragmrnts like plot axes
marks) are polluting the text.

--
Victor Kapustin
Saint-Petersburg State Univ.
Russia



More information about the Corpora mailing list