[Corpora-List] PDF Conversion
Piao, Songlin
s.piao at lancaster.ac.uk
Wed Mar 29 14:16:46 UTC 2006
Hi,
We tested the MultiValent tool for extracting text from pdf files and found it is working pretty well.
For identifying figures and tables etc, you need to add a post processor using some heuristic algorithms. We tried some algorithms for tables and figures and we got a reasonably good result.
Scott Piao
-------------------
Computing Department
Lancaster University
Lancaster LA1 4WA
UK
-----Original Message-----
From: owner-corpora at lists.uib.no [mailto:owner-corpora at lists.uib.no] On Behalf Of Ken Litkowski
Sent: 28 March 2006 16:35
To: corpora at hd.uib.no
Subject: [Corpora-List] PDF Conversion
Is anyone aware of free software that will process PDF documents into text streams? There is a PDF2HTML (with an XML option) that will create page-centric versions, but this does not really distinguish text from format. I want to ignore (or be able to treat separately) such things as headers, footnotes, tables, figures, and equations. (Note that even Google retains the page-centric view.)
Thanks,
Ken
--
Ken Litkowski TEL.: 301-482-0237
CL Research EMAIL: ken at clres.com
9208 Gue Road
Damascus, MD 20872-1025 USA Home Page: http://www.clres.com
More information about the Corpora
mailing list