Google's metadata mish-mash

Mon Dec 20 07:36:51 UTC 2010

Sometimes the dates actually follow a fairly predictable pattern.
Serials and government documents often get dated from the date of the
volume 1--not just the first issue in the scanned volume. If there is
any art or facsimile reproduction of an earlier edition in front matter,
the date metatag is often picked up from this piece rather than the
copyright date. Finally, all title pages in the 19th century often get
6s and 9s misread as 0s, resulting in 1860 and 1899 being misread as
1800. Some 8s and 3s also get misinterpreted. Similarly, in many cases,
particularly with the publications from 1905-1920, the 9s get read as 8s
and, in the 1870s, the 7s get read as 1s. Other date errors are
introduced by reprints, and, in particular, by forewords and commentary
on facsimile editions. Thus, for example, one may get perfectly 20th
century coined words being attributed to the 18th century editions of
Shakespeare.

But that's not the main source of corruption in the reduced set of data
here. There is another source.  Google OCR appears to have been
optimized for 20th century fonts. It does a passable job interpreting
most of the 19th century materials, but chokes up on books and serials
published before 1820 or so. And if there are italics, Gothic script and
long Ss--forget it! And even with individual words there are a lot of
problems. If the scanned word is not in the database that the OCR checks
for spelling, quite often the word gets cupertinoed. This was the case
with most late 19th century appearances of "biometrical" which Google
OCR, apparently expecting "biometric" as the normal form, instead
interpreted as "biomedical"--when I searched for "biomedical", 85% of
all hits between 1850 and 1920 were of this type (not that there were a
lot of hits--and the concentrated around 1890). Overall, this may
introduce a fairly small, but significant, percentage of corruption into
the whole body, but for individual searches, some of these are
devastating. I've had searches where of 160-200 hits, every single one
was either there because of a corrupt date or an OCR error.

Another item that I found a couple of days ago throws another twist. UC
libraries have fairly substantial collections of Indian materials from
1940-1970, particularly in several sciences. I witnessed this in person
when I was manually looking for particular materials in the mathematics
library in Berkeley, but there is plenty of evidence of this in GB as
well. The problem is, for some reason GB completely screws up the dates
on these materials and in an unpredictable manner. In the aforementioned
search for "biomedical", most of the hits that did not actually read
"biometrical" were publications of this type, misdated by about 50-60
years. I have no idea what contributed to this.

All this makes for an annoyance for individual searches, but, if it is
not accounted for, it can throw a big monkey wrench into meta-analysis.

     VS-)

On 12/20/2010 12:32 AM, geoffrey nunberg wrote:
> Google Books is strewn with metadata errors. For this corpus, they made an effort to reduce some persistent problems by eliminating serials (for which it often happens that all articles bear the date of the first number) and via some other expedients, but hand checks reveal that the overall misdating rate (>  5 years) is still 5.8 percent, and it's higher than that for earlier years (see the discussion in the appendix of the Science paper). If you see an implausible spike in hits before 1800, this is very likely the source. For example there are 20 or so hits in the corpus for 'patriotism' pre-1700, and every one of them a metadata error.
>
> For more, see my article in the Chronicle of Higher Education, Aug. 31, 2009.
> http://chronicle.com/article/Googles-Book-Search-A/48245/
>
> Geoff

------------------------------------------------------------
The American Dialect Society - http://www.americandialect.org