<HTML dir=ltr><HEAD><TITLE>RE: [Corpora-List] word frequencies on the web</TITLE>

<META http-equiv=Content-Type content="text/html; charset=unicode">

<META content="MSHTML 6.00.2900.2963" name=GENERATOR></HEAD>

<BODY>

<DIV id=idOWAReplyText80247 dir=ltr>

<DIV dir=ltr><FONT size=2>Quick question about pdfs/ OCR:</FONT></DIV>

<DIV dir=ltr><FONT size=2></FONT> </DIV>

<DIV dir=ltr><FONT size=2>Some text is copied and from a pdf file and pasted into a text or Word file. It contains errors- say, for example, 'the' has become 'die' (you notice that in the original pdf the 't' and 'h' are quite close together). At what stage has this misrecognition/ miscopying occured? </FONT></DIV>

<DIV dir=ltr><FONT size=2>Where does the OCR take place? The OCR functionality is, presumably,  part of of the .pdf reader software itself?</FONT></DIV>

<DIV dir=ltr><FONT size=2></FONT> </DIV>

<DIV dir=ltr><FONT size=2>Can anything be done to deal with the problem? </FONT></DIV>

<DIV dir=ltr><FONT size=2></FONT> </DIV>

<DIV dir=ltr><FONT size=2>Duncan Hunter</FONT></DIV>

<DIV dir=ltr><FONT size=2></FONT> </DIV>

<DIV dir=ltr><FONT size=2></FONT> </DIV></DIV></BODY></HTML>