[Corpora-List] PDF Conversion
Mike Maxwell
maxwell at ldc.upenn.edu
Wed Mar 29 01:42:07 UTC 2006
Tom Emerson wrote:
> Ken Litkowski writes:
>> Is anyone aware of free software that will process PDF documents into
>> text streams?
> ...
> Given that PDF is a page-centric format, so you are unlikely to find
> something that does what you are looking for: headers, footnotes,
> tables, etc. are not going to be flagged from the surrounding content
> in any special way.
I suspect (but don't know) that Tom's comments here (which I can second
from experience) are going to affect _any_ PDF-to-text converter. In
addition to his list of problems (and those others have mentioned, e.g.
the fact that some PDFs are basically bitmaps), here are some problems
we've encountered:
1) Multi-column text may come out as
<line1 from column1> <line1 from column2>
<line2 from column1> <line2 from column2>
etc., rather than what you want:
<line1 from column1>
<line2 from column1>
...
<line1 from column2>
<line2 from column2>
2) Character encoding can be a mess. In some cases it's sheer
gibberish; in other cases, you get something that is nearly "correct",
but with exceptions. I saw Tigrinya (Ethiopic language) text that came
out of PDFs as Unicode characters in the Ethiopic range, _except_ for
about five alphabetic characters, which came out in the ASCII range. We
were able to figure out what the particular characters were supposed to
be--something like glottal stop + a vowel, IIRC--and map them correctly.
But I've always wondered why they did it that way. In some of the
gibberish cases (Bengali, IIRC), I suspect it was just a proprietary
encoding. But I've seen English text extract as gibberish, so it almost
looks like some kind of encryption for the purpose of preventing you
from extracting the text.
3) If the original is s.t. like a newspaper or newsletter, a story may
continue on a later page, with other stories in between, leaving you to
try to piece together a single story that has interruptions of text from
other stories (and as Tom writes, from headers and footers). In one
case like this, we were resigned to manually piecing the stories back
together (it wasn't a language we knew, but you could sort of figure it
out), when someone (I believe it was Julie Medero, at the LDC)
discovered that the PDF files in question were built on the fly from
plain text source files. We happily took the text files instead!
Assuming that all these kinds of problems are inherit in the way the
text is stored in (non-bitmap) PDFS, a converter would have to be very
smart indeed to get well-structured text out reliably.
But if all you want is to mine new terms, why worry about the formatting
in the first place?
Mike Maxwell
More information about the Corpora
mailing list