[Corpora-List] PDF Conversion

Wed Mar 29 14:16:46 UTC 2006

Hi,

We tested the MultiValent tool for extracting text from pdf files and found it is working pretty well. 

For identifying figures and tables etc, you need to add a post processor using some heuristic algorithms. We tried some algorithms for tables and figures and we got a reasonably good result.

Scott Piao
-------------------
Computing Department
Lancaster University
Lancaster LA1 4WA
UK

-----Original Message-----
From: owner-corpora at lists.uib.no [mailto:owner-corpora at lists.uib.no] On Behalf Of Ken Litkowski
Sent: 28 March 2006 16:35
To: corpora at hd.uib.no
Subject: [Corpora-List] PDF Conversion

Is anyone aware of free software that will process PDF documents into text streams?  There is a PDF2HTML (with an XML option) that will create page-centric versions, but this does not really distinguish text from format.  I want to ignore (or be able to treat separately) such things as headers, footnotes, tables, figures, and equations.  (Note that even Google retains the page-centric view.)

Thanks,
	Ken
-- 
Ken Litkowski                     TEL.: 301-482-0237
CL Research                       EMAIL: ken at clres.com
9208 Gue Road
Damascus, MD 20872-1025 USA       Home Page: http://www.clres.com