[Corpora-List] PDF Conversion

Mike Maxwell maxwell at ldc.upenn.edu
Wed Mar 29 01:42:07 UTC 2006


Tom Emerson wrote:
> Ken Litkowski writes:
>> Is anyone aware of free software that will process PDF documents into 
>> text streams?  
> ...
> Given that PDF is a page-centric format, so you are unlikely to find
> something that does what you are looking for: headers, footnotes,
> tables, etc. are not going to be flagged from the surrounding content
> in any special way.

I suspect (but don't know) that Tom's comments here (which I can second 
from experience) are going to affect _any_ PDF-to-text converter.  In 
addition to his list of problems (and those others have mentioned, e.g. 
the fact that some PDFs are basically bitmaps), here are some problems 
we've encountered:

1) Multi-column text may come out as
	<line1 from column1>   <line1 from column2>
	<line2 from column1>   <line2 from column2>
    etc., rather than what you want:
	<line1 from column1>
	<line2 from column1>
	...
	<line1 from column2>
	<line2 from column2>

2) Character encoding can be a mess.  In some cases it's sheer 
gibberish; in other cases, you get something that is nearly "correct", 
but with exceptions.  I saw Tigrinya (Ethiopic language) text that came 
out of PDFs as Unicode characters in the Ethiopic range, _except_ for 
about five alphabetic characters, which came out in the ASCII range.  We 
were able to figure out what the particular characters were supposed to 
be--something like glottal stop + a vowel, IIRC--and map them correctly. 
  But I've always wondered why they did it that way.  In some of the 
gibberish cases (Bengali, IIRC), I suspect it was just a proprietary 
encoding.  But I've seen English text extract as gibberish, so it almost 
looks like some kind of encryption for the purpose of preventing you 
from extracting the text.

3) If the original is s.t. like a newspaper or newsletter, a story may 
continue on a later page, with other stories in between, leaving you to 
try to piece together a single story that has interruptions of text from 
other stories (and as Tom writes, from headers and footers).  In one 
case like this, we were resigned to manually piecing the stories back 
together (it wasn't a language we knew, but you could sort of figure it 
out), when someone (I believe it was Julie Medero, at the LDC) 
discovered that the PDF files in question were built on the fly from 
plain text source files.  We happily took the text files instead!

Assuming that all these kinds of problems are inherit in the way the 
text is stored in (non-bitmap) PDFS, a converter would have to be very 
smart indeed to get well-structured text out reliably.

But if all you want is to mine new terms, why worry about the formatting 
in the first place?

    Mike Maxwell



More information about the Corpora mailing list