[Corpora-List] 'imperfect' corpora

Ed Kenschaft ekenschaft at gmail.com
Thu Nov 16 13:50:27 UTC 2006


Okan Kolak did some work in OCR postprocessing, although I don't think he's
pursuing that currently.

Try:

Okan Kolak and Philip Resnik, "OCR Post-Processing for Low Density
Languages", HLT/EMNLP 2005, Vancouver, October 2005.

On 11/16/06, Yannick Versley <versley at sfs.uni-tuebingen.de> wrote:
>
> Hi,
>
> > I have been given access to a large amount of  data, which has been
> OCR'd
> > into a digital (.txt file) format. The data is extremely valuable for a
> > number of reasons and I would like to carry out, amongst other things, a
> > Keyword analysis.  However,  test-runs with corpus investigation tools
> show
> > that there are a few problems with the reliability of the corpus due to
> OCR
> > errors (mis-copying and fragmentation of words over end-of-line
> boundaries,
> > etc.).
> I think it may be worth trying to (semi-)automatically correct the most
> blatant of these errors, for example to merge word fragments that are
> split over the end of the line, or (assuming that the errors are rare in
> proportion to the rest) to correct rare words that do not occur in a
> dictionary or another known-good word list and are not capitalized (i.e. a
> named entity) to the nearest word that may be the correct spelling.
> Of course, there is much guesswork involved here, but if you aim for a
> keyword
> analysis, you have a better chance if you correct errors using a moderate
> amount of linguistic knowledge than if you just try to live with the noisy
> data.
>
> Best,
> Yannick Versley
>
> --
> Yannick Versley
> Seminar für Sprachwissenschaft, Abt. Computerlinguistik
> Wilhelmstr. 19, 72074 Tübingen
> Tel.: (07071) 29 77352
>
>


-- 
Ed Kenschaft
Ph.D. student, Computational Linguistics, University of Maryland
ekenschaft at gmail.com
www.umiacs.umd.edu/users/kensch/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20061116/caa41f67/attachment.htm>


More information about the Corpora mailing list