The perils of ngrams (was Usage Ridicule)

Geoffrey Nunberg nunberg at ISCHOOL.BERKELEY.EDU
Sat Mar 31 16:18:28 UTC 2012


> From: Victor Steinbok <aardvark66 at GMAIL.COM>
> Date: March 28, 2012 6:57:15 PM HST
> Subject: Re: Fwd: usage ridicule
> 
> Aside from the fact that the graphs are in no way similar, there is the
> problem that restricting to only British English duplicates most of the
> picture, but not quite. Until the WWII years, "a historical" and "an
> historical" behave similarly. Both start a mild decline in the 1890s,
> but the decline is much shorter than the overall graph, continuing
> increasing precipitously from the late 1950s.
> 
> http://goo.gl/2BUfi

I've written elsewhere about how dreadful the Google Books metadata are (see http://chronicle.com/article/Googles-Book-Search-A/48245/ and http://bit.ly/H55iDu). The difficulties carry over to the ngrams tool in a number of ways, One particularly egregious eg is the "British English" corpus. Of the hits for "a historical) that come up in this search in the "British English" corpus, the vast majority are American publications, and so, too, are around half the books that come up for "an historical." 

I've been collecting eg's of just how terrible a job Google Labs did on this tool, and when I get around to it I'll do a LanguageLog post enumerating all the problems. These are not going to be fixed; Google Labs has entirely abandoned it after having gotten themselves a flurry of publicity when it was first released. 

Geoff
------------------------------------------------------------
The American Dialect Society - http://www.americandialect.org



More information about the Ads-l mailing list