scanning older documents to text and OCR

Thu Jul 22 18:59:08 UTC 2010

Dear RNLD folks,

I have a question about scanning older linguistics materials to text. The originals are clear, and the program I was using (OnmiPage 15.2) does a good job of recognizing the English alphabet. However the program does not recognize all of the characters used (barred l, schwa, small caps, some diacritics, etc.). The computer had fonts installed that contain these characters - I am not sure if that is relevant. As far as I can tell -  or tech support here could tell - there was no way for the program to ‘learn’ these characters. They had to be fixed by hand each time. It seemed to me that no time would be saved by using it - fixing the scan output would take as long as entering the texts. 

There are newer versions of this program. OmniPage 17 claims ‘individual character training’ but I don’t know if this will apply to non-standard characters. (The website specifically says it will handle Chinese, Japanese and Korean). 

I am wondering if list members have any suggestions about optical character recognition (OCR) technology and scanning materials with characters not in the English alphabet. Is there anything out there that will shorten the process of getting these documents digitized? I do not need the output to look exactly like the original - so, for example, if I could train a program to know that wherever it sees a schwa, insert a capital E - that would work. 

Many thanks, 

Joana 

--
Joana Jansen
jjansen at uoregon.edu
Northwest Indian Language Institute and
Department of Linguistics, University of Oregon
1629 Moss Street
Eugene OR 97403
phone: 541-346-0730
fax: 541-346-0686