[Corpora-List] pdfs/ OCR question

Hunter, Duncan D.I.Hunter at warwick.ac.uk
Mon Dec 11 14:10:23 UTC 2006


Quick question about pdfs/ OCR:
 
Some text is copied and from a pdf file and pasted into a text or Word file. It contains errors- say, for example, 'the' has become 'die' (you notice that in the original pdf the 't' and 'h' are quite close together). At what stage has this misrecognition/ miscopying occured? 
Where does the OCR take place? The OCR functionality is, presumably,  part of of the .pdf reader software itself?
 
Can anything be done to deal with the problem? 
 
Duncan Hunter
 
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20061211/b22981b1/attachment.htm>


More information about the Corpora mailing list