[Corpora-List] 'imperfect' corpora

Thu Nov 16 16:59:42 UTC 2006

It looks like everyone is suggesting you correct the OCR errors rather than
deal with the imperfect data in a good way. But I think the latter problem
is more interesting, if only because you will most likely never have an
error-free corpus. It's a harder machine-learning-level problem, but it
doesn't mean it can't be solved. Unfortunately, I have no ideas about it.

I think one of Kolak and Resnik's works deals with n-gram models at the
character level. This means you are getting the power of dictionary
lookups/correction, but with greater flexibility. It also lets you correct
misplaced word boundaries by encoding spaces as characters -- which is one
way to join word fragments that have been split and vice versa.

On 11/16/06, Ed Kenschaft <ekenschaft at gmail.com> wrote:
>
> Okan Kolak did some work in OCR postprocessing, although I don't think
> he's pursuing that currently.
>
> Try:
>
> Okan Kolak and Philip Resnik, "OCR Post-Processing for Low Density
> Languages", HLT/EMNLP 2005, Vancouver, October 2005.
>
> On 11/16/06, Yannick Versley < versley at sfs.uni-tuebingen.de> wrote:
> >
> > Hi,
> >
> > > I have been given access to a large amount of  data, which has been
> > OCR'd
> > > into a digital (.txt file) format. The data is extremely valuable for
> > a
> > > number of reasons and I would like to carry out, amongst other things,
> > a
> > > Keyword analysis.  However,  test-runs with corpus investigation tools
> > show
> > > that there are a few problems with the reliability of the corpus due
> > to OCR
> > > errors (mis-copying and fragmentation of words over end-of-line
> > boundaries,
> > > etc.).
> > I think it may be worth trying to (semi-)automatically correct the most
> > blatant of these errors, for example to merge word fragments that are
> > split over the end of the line, or (assuming that the errors are rare in
> > proportion to the rest) to correct rare words that do not occur in a
> > dictionary or another known-good word list and are not capitalized (i.e.
> > a
> > named entity) to the nearest word that may be the correct spelling.
> > Of course, there is much guesswork involved here, but if you aim for a
> > keyword
> > analysis, you have a better chance if you correct errors using a
> > moderate
> > amount of linguistic knowledge than if you just try to live with the
> > noisy
> > data.
> >
> > Best,
> > Yannick Versley
> >
> > --
> > Yannick Versley
> > Seminar für Sprachwissenschaft, Abt. Computerlinguistik
> > Wilhelmstr. 19, 72074 Tübingen
> > Tel.: (07071) 29 77352
> >
> >
>
>
> --
> Ed Kenschaft
> Ph.D. student, Computational Linguistics, University of Maryland
> ekenschaft at gmail.com
> www.umiacs.umd.edu/users/kensch/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20061116/55c93364/attachment.htm>