OCR advice?

James Partridge james.partridge at ST-EDMUND-HALL.OXFORD.AC.UK
Wed Feb 2 11:11:41 UTC 2000


Michael
One of the biggest OCR companies, and the one that makes most fuss about its
multilanguage support, is Scansoft (http://www.textbridge.com). I have a
copy of their TextBridge Pro 9.0 OCR software which claims support for some
58 languages, including all Cyrillic and Central European languages. My
colleague James Naughton uses the same program.

We primarily use TB9.0 for scanning Czech poetry - quite close to what
you're proposing - although occasionally we've tried it with other Central
European languages and the odd page or two of Russian. Our experience is
generally very good, although the product doesn't quite live up to its
claims. For the sake of fairness, though, I should add that I'm on the TB
mailing list and I've rarely seen so much dislike of a product from so many
people. Mind you, a lot of the people who complain that they are completely
unable to work with it are also completely unable to understand the simple
one-line instruction for signing off the mailing list, so perhaps they're
not the best judges.

Anyway, after quite a bit of experimenting we have found ways to make TB9.0
scan Czech poems into various file formats with a pretty high accuracy rate.
This can be improved, by the way, by "teaching" the OCR software -
particularly helpful if you're working a lot with a particular font. It is
not perfect, but I've scanned tens of pages of poetry with an hour or two's
work - an awful lot faster that you could type it. I know this because prior
to getting TB9.0 (it's rather a new product) I did type about 5000 lines of
poetry into my computer - something that took me many months. I now know
that I could have achieved the same result in a couple of days with TB9.0.

As I say, though, TB9.0 doesn't quite meet its specifications. Scanning
directly into Word is possible - we've done it into both Word97 and Word
2000 - but some languages work better than others. In our case, there is a
problem with handling Czech characters in Word, although other languages
we've tried (such as Polish) have worked OK. We've been in touch with
ScanSoft about this.

Ironically, the best results are obtained by scanning directly into HTML -
this tends to go very smoothly. In fact, when we started using TB9.0 we
scanned into HTML first, then put that text by devious means into Word. The
very best route, though, is to go into a plain text editor, assuming your
plain text editor can manage the particular language. Fortunately, the
brilliant NoteTab (http://www.notetab.com/) can manage CE, Baltic, Cyrillic
and pretty much any other language you care to think of, so that's not a
problem. You can also edit your texts in NoteTab and convert them directly
into HTML, so that's another advantage.

It did take me several days to work out how to deal with poetry. Scanners
tend to see poems as prose with inexplicably large white spaces between
words (not a bad way to see them, actually) so your page of Tyutchev or
Pushkin tends to get turned into a short paragraph. Once you understand the
principles of zoning and marking text, however, this problem generally goes
away.

I must say that TB9.0 does have some rough edges, and it was a big advantage
for us that we both work a lot with computers so were able to devise various
workarounds (some of them rather arcane) for stupid problems. Despite this,
it works pretty well for us and is a vast improvement on typing the stuff
out by hand. TB9.0 is not cheap though - £60 in the UK - and as I say, I
have read a lot of negative feedback about it.

I have recently heard good things about a Russian OCR package called Abbyy
Fine Reader (http://www.abbyy.com/) which looks rather interesting. The fact
that it's a Russian product tends to suggest that it should cope with
Cyrillic pretty well, and it claims another 52 languages as well as
spell-checking for a whole pile of them. Best of all, you can download a
demo of the entire thing (from:
http://www.abbyy.com/products/fine/down/license.htm) although it is a pretty
hefty size. I haven't tried this myself yet, but their site looks very
professional, they seem to have excellent reviews, and I certainly plan to
check them out asap.

I hope that's useful info for you. If I can help any more please let me
know.

James

James Partridge
St Edmund Hall
Oxford
[Central Europe Review: http://www.ce-review.org]


----- Original Message -----
From: "Michael A. Denner" <    >
To: <SEELANGS at LISTSERV.CUNY.EDU>
Sent: Tuesday, February 01, 2000 11:22 PM
Subject: OCR advice?


> I'm about to begin a project that will involve converting 200+ pages of
> Cyrillic text into something that will eventually be in HTML format. Since
> there are many tech-savvy people and companies that read this usenet, I
> thought I'd start here.
>
> I'm looking for any recommendations for OCR software: The text that needs
to
> be scanned is clear & fairly homogeneous, but it's poetry, so formatting
is
> a complicated affair. Since this will eventually be used in HTML
documents,
> the scanner should convert the text into  (I think) KOI-8, preferably to
> other formats as well (like the MAC- or PC-related codes for HTML editing
in
> Cyrillic). Ideally, it should scan directly into Microsoft Word, since
I've
> had good luck converting Cyrillic documents from Word to DreamWeaver (the
> HTML editor I use).
>
> Has anyone had any experience with OCR technology? Any problems using the
> data in HTML format? Does Microsoft have integrated software to use with
> Cyrillic? Any and all advice appreciated. Please respond off list, unless
> you believe that your response will be of general interest.
>
> Michael A. Denner
> Northwestern University
>
>
> +++***+++
> the preacher should shout... with thundering voice: "'pause, avast, why so
> seeming fast, but deadly slow?'"
> thoreau. walden. 1854.
>
> -------------------------------------------------------------------------
>  Use your web browser to search the archives, control your subscription
>   options, and more.  Visit and bookmark the SEELANGS Web Interface at:
>                 http://members.home.net/lists/seelangs/
> -------------------------------------------------------------------------
>

-------------------------------------------------------------------------
 Use your web browser to search the archives, control your subscription
  options, and more.  Visit and bookmark the SEELANGS Web Interface at:
                http://members.home.net/lists/seelangs/
-------------------------------------------------------------------------



More information about the SEELANG mailing list