[RNLD] OCR

Fri Dec 6 09:55:06 UTC 2019

Hi Nick,

I agree with Paul, Tesseract is worth looking into.

For context, there have been quite many changes in OCR technology in last
years. Tesseract, since version 4, uses line based training data, so the
model is trained with line images and corresponding text. So it doesn't
learn exact characters, but uses the whole line and its content. I don't
think it is possible to teach any of these systems superscript at the
moment, as they learn from plain character string with no formatting, but
mapping superscript w into some arbitrary placeholder that keeps it
distinct should work really well. You get the output as XML anyway so you
can process it then.

When I was training Tesseract last time I was using this:

https://github.com/tesseract-ocr/tesstrain

There are also other systems, i.e. Calamari is actively developed and I
have had really good experiences with it:

https://github.com/Calamari-OCR/calamari

What makes Tesseract more practical at the moment is that it does also the
layout detection. If Tesseract gives you decent layout and line detection
results, then working onward with that is an option. With Calamari and
others one has to build bit more complex pipelines to run different tools.

The problem which will remain is that these tools aren't very good for
proofreading the text and correcting manually the mistakes in layout
detection. For this reason I would suggest that you look into Transkribus
as well. It is mainly open source, but there are some components that are
not publicly available. Anyway it isn't really a commercial tool either in
the typical sense. It has really good interfaces, both as the Transkribus
program and web editor, whick makes collaborative editing of materials
really easy.

The tool is designed for hand written text, but it works extremely well
with printed text too.

https://transkribus.eu/

To train Transkribus models you need to contact the developers so they add
those rights to your profile.

I'm involved in several projects around OCR and HTR, so if there are some
new questions I probably have some ideas and examples to share. Good luck
with your OCR task!

Best wishes,

Niko

On Fri, Dec 6, 2019 at 9:18 AM Trilsbeek, Paul <Paul.Trilsbeek at mpi.nl>
wrote:

> Hi Nick,
>
> Perhaps you could give Tesseract a try. No idea whether that would do any
> better, but it seems to be used a lot nowadays and it's free and open
> source.
>
> https://github.com/tesseract-ocr/tesseract
>
> Best,
>
> Paul
>
> On 5 Dec 2019, at 21:16, Nick Thieberger <thien at unimelb.edu.au> wrote:
>
> Has anyone had experience of successful OCR of ŋ and superscript w? I have
> tried in ABBYY and OmniPage with no success. This is to produce a new
> version of an existing print dictionary for which we havea pdf.
>
> Thanks,
>
> Nick
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/resource-network-linguistic-diversity/attachments/20191206/d604c217/attachment.htm>