OCR advice?

Yoshimasa Tsuji yamato at YT.CACHE.WASEDA.AC.JP
Wed Feb 2 00:46:24 UTC 2000


Hello,
There are only two OCR software in the market: FineReader and
CuneiForm. FineReader is more expensive as it is the winner
and CuneiForm has become much cheaper, being the loser. Both
of them are good enough, but everyone agrees to the FineReader
being the best. You simply have no choice as a matter of fact.
So, using e.g. MacTiger on your Mac, etc. or other obscure
OCR software is a sheer waste of time unless you are going to
scan a very small amount of data.

Modern OCR can do almost everything what you want it to do: scan pages,
convert images (rotation, mirror image, b/w reversal, editing, etc.),
import/export images from/to all sorts of encodings(jpg, gif, tiff, etc.),
recognition and proof reading of some 30 languages, trainability (teach
glagolic glyphs, for example), recognition of handwriting, recognition of
pro-forma data such as questionnaire results, export to all sorts of
document formats (MS Word, MS Excel, HTML, etc., preserving drawings and
pictures in the right position). If you scan a page of Volume 1 of Polnoe
Sobranie Sochinenij F. Dostoevskogo published in the 1970's, you will need to
fix a word or two for every three pages. If you organize your work
properly, it will take only an hour to recognize and proof-read the whole
Volume 1 of the above-said Dostoevskij. That is surely much faster than a
professional typist.

However, there is a catch in this account. The decisive factor No 1 is the
quality of the scanner: high end scanners like Fujitsu or Ricoh will be
fine, but low end scanners will take ages to scan a whole book (good ones
will scan 30 to 80 pages a minute while poor ones four pages or less and
often lacks a sheet feeder). There is absolutely no need to have a
colour scanner or a high-resolution scanner (600 dpi is the maximum).
  The decisive factor No 2 is the quality of the device driver of the
scanner: if it cannot allow you to set the darkness threshold properly,
you will have a hard time. If it makes the contour smoother, etc., you
will be very comfortable. High end scanners usually come with excellent
scanning software, use it instead of the built-in scanner of the OCR.
  There are still loads of things that are wished for, e.g. FineReader
cannot  properly read a bilingual text where the left page is in English
and the right page in Russian: all you can do is set the working language
as English&Russian. Then FR will try to find English glyphs in the wrong
page and make mistakes.

  Lastly, I advise you not to be impressed by the recognition rate of
99%. It means the OCR will make a mistake for every hundred symbols,
including white spaces, punctuation marks, etc. It usually means that
you need to edit at least one word for every line!! In order to work
comfortably you must prepare an ultra-ultra beautiful image file.
  If the scanneed image is poor, you will have a nightmare of checking
every one letter word (a, v, i, k, o, s, u, ja) almost for ever.

Cheers,
Tsuji

-------------------------------------------------------------------------
 Use your web browser to search the archives, control your subscription
  options, and more.  Visit and bookmark the SEELANGS Web Interface at:
                http://members.home.net/lists/seelangs/
-------------------------------------------------------------------------



More information about the SEELANG mailing list