[Corpora-List] PDF Conversion
Brett Powley
bpowley at ics.mq.edu.au
Wed Mar 29 03:11:33 UTC 2006
Hi Ken,
The work I have been doing (with the ACL anthology) involves doing
precisely this. I spent some time evaluating tools to do it, including:
Adobe Reader (using Save as Text...)
Multivalent (Java, open source)
PDFBox (Java, open source)
XPDF (open source)
Etymon Pjx (open source)
PDFTextStream (commercial)
JPedal (commercial)
Argus (commercial)
3-heights PDF extract (pdf-tools) (commercial)
(I also looked to see whether Mac OS X provided any API for text
extraction since it has built-in PDF support and it indexes PDF
documents, but if there is an API it's not a public one yet.)
The one that gave the best performance was PDFBox (open source,
Java), but among the ones that performed well, there really wasn't
much in it.
There are two major issues in PDF extraction:
(1) Page layout -- footnotes, columns, etc. PDF is (was) designed to
provide an accurate on screen or printed rendering of a document
(it's essentially a special version of PostScript), so getting the
text back out wasn't an issue for the original designers at least.
This means in theory that the text can appear in the file in any
arbitrary order (the order in which it's drawn), though in practice
it tends to be in a somewhat sensible order -- the text tends to be
in order, and columns tend to be OK too. Footnotes, headers, and
footers, however are a more difficult problem.
(2) Font encoding -- when a PDF document uses an embedded font
subset, the mapping between the character codes used for characters
and what characters they represent is generally unknown. The
document essentially looks like "draw character X here" where X
points to the glyph which should be drawn, but is otherwise arbitrary.
All of the tools above failed on documents with embedded fonts in the
same way. For the ACL anthology, this seemed to affect about 40% of
the documents.
One solution to this problem (albeit not a very elegant one) is to
render the PDF documents with font encoding as bitmaps, and then run
OCR on them.
Hope this helps,
Brett
On 29/03/2006, at 2:35 AM, Ken Litkowski wrote:
> Is anyone aware of free software that will process PDF documents
> into text streams? There is a PDF2HTML (with an XML option) that
> will create page-centric versions, but this does not really
> distinguish text from format. I want to ignore (or be able to
> treat separately) such things as headers, footnotes, tables,
> figures, and equations. (Note that even Google retains the page-
> centric view.)
>
> Thanks,
> Ken
> --
> Ken Litkowski TEL.: 301-482-0237
> CL Research EMAIL: ken at clres.com
> 9208 Gue Road
> Damascus, MD 20872-1025 USA Home Page: http://www.clres.com
>
>
>
--------------------------------------------------------------
Brett Powley -- PhD Candidate
Centre for Language Technology, Macquarie University, Australia
p: +61-402-013050 f: +61-2-90120813 e: bpowley at ics.mq.edu.au
w: http://www.ics.mq.edu.au/~bpowley
faciendi plures libros nullus est finis
frequensque meditatio carnis adflictio est
--------------------------------------------------------------
More information about the Corpora
mailing list