[Corpora-List] 'imperfect' corpora

Thu Nov 16 08:48:15 UTC 2006

Hi,

> I have been given access to a large amount of  data, which has been OCR'd 
> into a digital (.txt file) format. The data is extremely valuable for a
> number of reasons and I would like to carry out, amongst other things, a
> Keyword analysis.  However,  test-runs with corpus investigation tools show
> that there are a few problems with the reliability of the corpus due to OCR
> errors (mis-copying and fragmentation of words over end-of-line boundaries,
> etc.).
I think it may be worth trying to (semi-)automatically correct the most 
blatant of these errors, for example to merge word fragments that are
split over the end of the line, or (assuming that the errors are rare in
proportion to the rest) to correct rare words that do not occur in a 
dictionary or another known-good word list and are not capitalized (i.e. a 
named entity) to the nearest word that may be the correct spelling.
Of course, there is much guesswork involved here, but if you aim for a keyword 
analysis, you have a better chance if you correct errors using a moderate 
amount of linguistic knowledge than if you just try to live with the noisy 
data.

Best,
Yannick Versley

-- 
Yannick Versley
Seminar für Sprachwissenschaft, Abt. Computerlinguistik
Wilhelmstr. 19, 72074 Tübingen
Tel.: (07071) 29 77352