Google's metadata mish-mash

geoffrey nunberg nunberg at ISCHOOL.BERKELEY.EDU
Mon Dec 20 05:32:03 UTC 2010

Google Books is strewn with metadata errors. For this corpus, they made an effort to reduce some persistent problems by eliminating serials (for which it often happens that all articles bear the date of the first number) and via some other expedients, but hand checks reveal that the overall misdating rate (> 5 years) is still 5.8 percent, and it's higher than that for earlier years (see the discussion in the appendix of the Science paper). If you see an implausible spike in hits before 1800, this is very likely the source. For example there are 20 or so hits in the corpus for 'patriotism' pre-1700, and every one of them a metadata error.

For more, see my article in the Chronicle of Higher Education, Aug. 31, 2009.


> From: Gerald Cohen <gcohen at MST.EDU>
> Date: December 19, 2010 11:13:33 AM PST
> Subject: Re: Linguistic dark matter - search for pre-1843 "shyster"
> Thanks, Fred. I just checked Google books for pre-1843 "shyster", and
> Google's new searchable database clearly turns out to have inaccuracies when
> it indicates that "shyster" had limited use around 1800 and 1820.
> (My research indicates the term actually arose in 1843).
> Two examples:
> 1) Google books has a "shyster" quote from 1801 (Joseph Bushnell Grinnell,
> _Men and Events of Forty Years_, p. 409). But Grinnell wasn't even born
> in 1801 (his dates are 1821 - 1891), and his book was published in the
> year of his death.
> 2) Google has another 1801 "shyster" attestation: T. Dewitt Talmadge,
> _Night Scenes of City Life_, p. 121.  But Talmadge's dates are (1832 -
> 1902), and WorldCat says the book was published in 1891.
I found another supposed pre-1843 attestation of "shyster", but when I
checked the page, the spelling was "Thyster" and had nothing to do with

Oy vey!

Gerald Cohen

The American Dialect Society -

More information about the Ads-l mailing list