<HTML dir=ltr><HEAD>

<META http-equiv=Content-Type content="text/html; charset=unicode">

<META content="MSHTML 6.00.2900.2963" name=GENERATOR></HEAD>

<BODY>

<DIV><FONT face=Arial color=#000000 size=2>Hi list members,</FONT></DIV>

<DIV><FONT face=Arial size=2></FONT> </DIV>

<DIV><FONT face=Arial size=2>I have been given access to a large amount of  data, which has been OCR'd  into a digital (.txt file) format. The data is extremely valuable for a number of reasons and I would like to carry out, amongst other things, a Keyword analysis.  However, </FONT><FONT face=Arial size=2> test-runs with corpus investigation tools show that there are a few problems with the reliability of the corpus due to OCR errors (mis-copying and fragmentation of words over end-of-line boundaries, etc.). </FONT></DIV>

<DIV><FONT face=Arial size=2></FONT> </DIV>

<DIV><FONT face=Arial size=2>How can valuable but 'imperfect' corpus data be utilised effectively? Any tips as to how anomalous (but generally explicable) results can be described/accounted for in a principled, consistent manner?</FONT></DIV>

<DIV><FONT face=Arial size=2></FONT> </DIV>

<DIV><FONT face=Arial size=2></FONT> </DIV>

<DIV><FONT face=Arial size=2>Duncan Hunter</FONT></DIV></BODY></HTML>