The perils of ngrams (was Usage Ridicule)

Joel S. Berson Berson at ATT.NET
Sat Mar 31 18:09:57 UTC 2012


Would you say that gross evidence of trends in Ngram has some
validity, if one avoids attempted refinements such as British vs.
American corpuses (corpi?)?

For example, elsewhere I graphed "years old" vs. "years of his age",
and concluded "The two phrases track each other remarkably closely
(not just in relative increases of each, but in actual percentage of
appearance) from 1700 to about 1810, when "years old" increases some
more and "of his age" drops off significantly."  (With a caution
about what an ngram actually plots.)  My common sense tells me that a
drop-off of "of his age" is correct, and that it began around 1810 is
probably correct also.  Google's dating errors likely does not
significantly affect this.


At 3/31/2012 12:18 PM, Geoffrey Nunberg wrote:
> > From: Victor Steinbok <aardvark66 at GMAIL.COM>
> > Date: March 28, 2012 6:57:15 PM HST
> > Subject: Re: Fwd: usage ridicule
> >
> > Aside from the fact that the graphs are in no way similar, there is the
> > problem that restricting to only British English duplicates most of the
> > picture, but not quite. Until the WWII years, "a historical" and "an
> > historical" behave similarly. Both start a mild decline in the 1890s,
> > but the decline is much shorter than the overall graph, continuing
> > increasing precipitously from the late 1950s.
> >
> >
>I've written elsewhere about how dreadful the Google Books metadata
>are (see
>and The difficulties carry over to the ngrams
>tool in a number of ways, One particularly egregious eg is the
>"British English" corpus. Of the hits for "a historical) that come
>up in this search in the "British English" corpus, the vast majority
>are American publications, and so, too, are around half the books
>that come up for "an historical."
>I've been collecting eg's of just how terrible a job Google Labs did
>on this tool, and when I get around to it I'll do a LanguageLog post
>enumerating all the problems. These are not going to be fixed;
>Google Labs has entirely abandoned it after having gotten themselves
>a flurry of publicity when it was first released.
>The American Dialect Society -

The American Dialect Society -

More information about the Ads-l mailing list