[Corpora-List] PDF Conversion

Wed Mar 29 21:12:13 UTC 2006

Hi Ken, all,

Just to add to Scott's note about Multivalent, the website is: 

http://multivalent.sourceforge.net/

We compared it to Adobe Acrobat v6 and v7 and found that for extracting
text and preservation of text flow in two column format (such as in the
ACL Anthology) Multivalent is much more accurate. Obviously this is for
text-based PDFs. With image-based PDFs (not sure of the percentage of
these in the ACL anthology) OCR seems to be the only way to go with say
Omnipage Pro v14. Even with Multivalent and text-based PDFs, you still
need to add post-processing procedures to deal with ligatures (ffi, fi,
fl, ff, ffl) and extended ASCII codes (>127) in order to pop the output
into unix/linux flavour tools. This is important for building word lists
and finding new lexical items!

Regards,
Paul.

Dr. Paul Rayson
Director of UCREL
Computing Department, Infolab21, South Drive, Lancaster University,
Lancaster, LA1 4WA, UK.
Web: http://www.comp.lancs.ac.uk/computing/users/paul/
Tel: +44 1524 510357 Fax: +44 1524 510492

-----Original Message-----
From: owner-corpora at lists.uib.no [mailto:owner-corpora at lists.uib.no] On
Behalf Of Piao, Songlin
Sent: 29 March 2006 15:17
To: Ken Litkowski; corpora at hd.uib.no
Subject: RE: [Corpora-List] PDF Conversion

Hi,

We tested the MultiValent tool for extracting text from pdf files and
found it is working pretty well. 

For identifying figures and tables etc, you need to add a post processor
using some heuristic algorithms. We tried some algorithms for tables and
figures and we got a reasonably good result.

Scott Piao
-------------------
Computing Department
Lancaster University
Lancaster LA1 4WA
UK

-----Original Message-----
From: owner-corpora at lists.uib.no [mailto:owner-corpora at lists.uib.no] On
Behalf Of Ken Litkowski
Sent: 28 March 2006 16:35
To: corpora at hd.uib.no
Subject: [Corpora-List] PDF Conversion

Is anyone aware of free software that will process PDF documents into
text streams?  There is a PDF2HTML (with an XML option) that will create
page-centric versions, but this does not really distinguish text from
format.  I want to ignore (or be able to treat separately) such things
as headers, footnotes, tables, figures, and equations.  (Note that even
Google retains the page-centric view.)

Thanks,
	Ken
-- 
Ken Litkowski                     TEL.: 301-482-0237
CL Research                       EMAIL: ken at clres.com
9208 Gue Road
Damascus, MD 20872-1025 USA       Home Page: http://www.clres.com