Okan Kolak did some work in OCR postprocessing, although I don't think he's pursuing that currently.<br><br>Try:<br><br> Okan Kolak and Philip Resnik, "OCR Post-Processing for Low
Density Languages", HLT/EMNLP 2005, Vancouver, October 2005.<br><br><div><span class="gmail_quote">On 11/16/06, <b class="gmail_sendername">Yannick Versley</b> <<a href="mailto:versley@sfs.uni-tuebingen.de">
versley@sfs.uni-tuebingen.de</a>> wrote:</span><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">Hi,<br><br>> I have been given access to a large amount of data, which has been OCR'd
<br>> into a digital (.txt file) format. The data is extremely valuable for a<br>> number of reasons and I would like to carry out, amongst other things, a<br>> Keyword analysis. However, test-runs with corpus investigation tools show
<br>> that there are a few problems with the reliability of the corpus due to OCR<br>> errors (mis-copying and fragmentation of words over end-of-line boundaries,<br>> etc.).<br>I think it may be worth trying to (semi-)automatically correct the most
<br>blatant of these errors, for example to merge word fragments that are<br>split over the end of the line, or (assuming that the errors are rare in<br>proportion to the rest) to correct rare words that do not occur in a
<br>dictionary or another known-good word list and are not capitalized (i.e. a<br>named entity) to the nearest word that may be the correct spelling.<br>Of course, there is much guesswork involved here, but if you aim for a keyword
<br>analysis, you have a better chance if you correct errors using a moderate<br>amount of linguistic knowledge than if you just try to live with the noisy<br>data.<br><br>Best,<br>Yannick Versley<br><br>--<br>Yannick Versley
<br>Seminar für Sprachwissenschaft, Abt. Computerlinguistik<br>Wilhelmstr. 19, 72074 Tübingen<br>Tel.: (07071) 29 77352<br><br></blockquote></div><br><br clear="all"><br>-- <br>Ed Kenschaft<br>Ph.D. student, Computational Linguistics, University of Maryland
<br><a href="mailto:ekenschaft@gmail.com">ekenschaft@gmail.com</a><br><a href="http://www.umiacs.umd.edu/users/kensch/">www.umiacs.umd.edu/users/kensch/</a>