[Corpora-List] pdfs/ OCR question

Brett Powley bpowley at ics.mq.edu.au
Tue Dec 12 03:45:43 UTC 2006


There are several issues with extracting text from PDF files:

- Scanned and OCRed documents, as has been mentioned, often have the  
scanned images of the original plus a text 'layer' to be used for  
copying-and-pasting.  Not all documents have this text layer, however.

- In some senses, it can be said that PDF 'preseves the original text  
strings'.  However, PDF wasn't designed for recovery of the original  
text; it was designed for faithful rendering on screen or on a  
printer. Frequently, spaces are missing from text in the PDF file --  
for rendering, this doesn't matter, since the characters simply need  
to be drawn in the correct place. However, for text extraction, the  
presence of spaces often has to be inferred from the position of  
surrounding characters.  Line breaks are never present, and again  
must be inferred from text placement. The sequence of text in the PDF  
document may not be the same sequence as in the original file, since  
sequence is irrelevant to rendering.  And so on...

- Some PDF files use font subsets with custom encodings -- they have  
a table at the beginning of the file with codes and the glyphs to  
render for each code; however, these codes aren't in ASCII or UTF-8  
or anything recognisable.  When you extract text from such a file,  
you generally get junk.

There are a few tools around for extracting text from PDF files --  
PDFBox and Multivalent are two open source tools that I've used that  
perform pretty well.

Good luck!

Brett Powley




On 12/12/2006, at 2:31 PM, John F. Sowa wrote:

> That depends on how the PDF was created:
>
> > interesting to know that pdf files store text info separately!
>
> Some PDF files are generated by scanning each page of a book or
> article into an image format (GIF or TIFF, for example).  In such
> a PDF file, there are no character strings internally, and some
> kind of OCR is necessary to convert the image into a character
> string.  The OCR process might convert an image for "the"
> into the character string "die".
>
> But if the PDF file had been generated from a text string in
> any textual form, such as HTML, LaTeX, TXT, ODT, or DOC formats,
> the internal PDF file preserves the original text strings.  If
> you copy and paste text from a PDF of that kind into an editor
> for some other kind of text, such as OpenOffice or MS Word, you
> will get a copy of the original character string, but some or
> all of the formatting info may be lost.  That process would
> never convert "the" into "die".
>
> There are some caveats, however.  Some PDF files may have
> special characters for ligatures, such as fi, fl, ff, etc.
> Even though the ligatures are represented in character strings,
> a copy & paste from such files to another editor may convert
> the ligature to an unrecognized character.  (Some OCR systems
> also have difficulty with ligatures because the letters "f"
> and "i" or "l" are too close together for easy recognition.)
>
> John Sowa
>



--------------------------------------------------------------
Brett Powley -- PhD Candidate
Centre for Language Technology, Macquarie University,  Australia
w: http://www.ics.mq.edu.au/~bpowley
faciendi plures libros nullus est finis
frequensque meditatio carnis adflictio est
--------------------------------------------------------------



More information about the Corpora mailing list