[Corpora-List] pdfs/ OCR question

Mon Dec 11 19:23:38 UTC 2006

Recent versions of Acrobat have internal OCR functionality. Apparently 
each page is analyzed separately and systematically processed in several 
steps. Character recognition occurs before words are built. Depending on 
which language was set as the OCR language (you can select several - and 
from the error you found, it might be that it was processed for German), 
you'll get different errors. You can always at any time rerun OCR on the 
page (provided you have Acrobat Standard or higher [1]). If you export 
the page images to TIFF format (lossless), you can run them through any 
OCR program, including the one provided as part of Microsoft Office 
(Microsoft Document Imaging).

I am currently unaware of any software that will clean up such errors, 
but Office 2007 imaging software may have some of that functionality 
built in, due to the fact that Word 2007 has probabilistic error 
detection. It's just a suspicion, and would have to be verified. Maybe 
the Microsoft people on this list would be able to help.

Best,

Klaus

[1] http://www.adobe.com/products/acrobat/matrix.html

---
Klaus Guenther
Graduate Assistant
Chair of English Linguistics
University of Bamberg, Germany

Hunter, Duncan wrote:
> Thanks for this Alexandre.
>  
> interesting to know that pdf files store text info separately! .... 
> that makes sense-and also means that the errors have already occured 
> (at the stage of pdf creation).
>  
> It looks like the job of fixing the textual errors is a big one. I 
> think it may simply be a question of accepting the limitations of the 
> corpus we've generated by 'ripping' text from the imperfect pdf files?
>  
> Many thanks,
>  
> Duncan
>
> ------------------------------------------------------------------------
> *From:* owner-corpora at lists.uib.no on behalf of Alexandre Rafalovitch
> *Sent:* Mon 11/12/2006 16:21
> *To:* corpora at uib.no
> *Subject:* Re: [Corpora-List] pdfs/ OCR question
>
> I would guess that the OCR had been done by the software that
> generated the PDF. You might be able to check what it is by looking at
> PDF document's properties. The text is stored on a separate layer from
> the image and the reader just does region matching for the selection
> purposes.
>
> If you need to have this fixed, you probably will need to burst out
> the PDF into its page images and have those re-OCRed.
>
> Software you might find useful include PDFBox (http://www.pdfbox.org/)
> and Gamera (http://ldp.library.jhu.edu/projects/gamera/)
>
> You can also look at the Distributed Proofreaders to see if there is
> anything to be learned from their experience: http://www.pgdp.net/
>
> Regards,
>    Alex.
>
> On 12/11/06, Hunter, Duncan <D.I.Hunter at warwick.ac.uk> wrote:
> > Quick question about pdfs/ OCR:
> >
> > Some text is copied and from a pdf file and pasted into a text or 
> Word file.
> > It contains errors- say, for example, 'the' has become 'die' (you notice
> > that in the original pdf the 't' and 'h' are quite close together). 
> At what
> > stage has this misrecognition/ miscopying occured?
> > Where does the OCR take place? The OCR functionality is, 
> presumably,  part
> > of of the .pdf reader software itself?
> >
> > Can anything be done to deal with the problem?
> >
> > Duncan Hunter
> >
> >
>