Quote: Everything should be as simple as it can be, but not simpler (antedating attrib Albert Einstein 1950)

Sun Feb 28 22:57:24 UTC 2010

Great thanks to Fred Shapiro for his compliment. I hope that the Yale
Book of Quotations and the forthcoming Yale Book of Modern Proverbs
appear online in the future. The Oxford Reference Online contains
several quotation, proverb, and idiom resource books, and it is quite
convenient. Continuous incremental improvement would be possible;
witness the OED.

Victor Steinbok wrote:
> A lot of the problems come from manual entries, not just from OCR. Does
> anyone know how Google can be contacted directly with a specific
> suggestion to have a repository for correction?

One way to contact the Google Books people is through their blog
called Inside Google Books. The "contact us" link in the right column
leads to a form for feedback and suggestions. There is a separate form
for "reporting issues with specific content", e.g., "Error with
bibliographic information", "Books merged together" and "Bad image
scan".

http://booksearch.blogspot.com/

Robin Hamilton asked:
> Does anyone know what OCR software googlebooks uses?  Seems a bit primitive
> to me.

I think Google uses its own homebrew optical character recognition
(OCR) system which they are trying to improve.

The official Google blog ran a story on 2009-09-16 about the
acquisition of the company reCAPTCHA. A CAPTCHA is a
challenge-response test used for screening access to systems and
attempting to block automated non-human agents. One type of CAPTCHA is
based on displaying distorted text and asking the test-taker to type
in characters in the text.

The blog posting says that Google hopes to use CAPTCHAs to help train
their optical character recognition system. The displayed text will
"come from scanned archival newspapers and old books. Computers find
it hard to recognize these words because the ink and paper have
degraded over time, but by typing them in as a CAPTCHA, crowds teach
computers to read the scanned text."

http://googleblog.blogspot.com/2009/09/teaching-computers-to-read-google.html
http://recaptcha.net/

Google Code has a web page about the Tesseract OCR engine which is an
open source project. Also, Google sponsors the development of the
OCRopus document analysis and OCR system. I do not know if this code
is used by Google itself.

http://code.google.com/p/tesseract-ocr/
http://code.google.com/p/ocropus/
http://en.wikipedia.org/wiki/OCRopus

Garson

------------------------------------------------------------
The American Dialect Society - http://www.americandialect.org