[Corpora-List] PDF Conversion

Hamish Cunningham hamish at dcs.shef.ac.uk
Tue Mar 28 15:50:16 UTC 2006


Ted Briscoe's group in Cambridge have a PDF converter - you might contact
them

Best

Hamish


Tom Emerson wrote:
> Ken Litkowski writes:
> 
>>Is anyone aware of free software that will process PDF documents into 
>>text streams?  There is a PDF2HTML (with an XML option) that will create 
>>page-centric versions, but this does not really distinguish text from 
>>format.  I want to ignore (or be able to treat separately) such things 
>>as headers, footnotes, tables, figures, and equations.  (Note that even 
>>Google retains the page-centric view.)
> 
> 
> Given that PDF is a page-centric format, so you are unlikely to find
> something that does what you are looking for: headers, footnotes,
> tables, etc. are not going to be flagged from the surrounding content
> in any special way.
> 

-- 
Hamish
http://www.dcs.shef.ac.uk/~hamish/



More information about the Corpora mailing list