The perils of ngrams (was Usage Ridicule)

Sat Mar 31 18:23:31 UTC 2012

Just as a point of clarification--I did not intend to make any grandiose
claims based on Ngrams in my original post. Quite to the contrary--I was
attempting to use the Ngrams to debunk an earlier claim based on them.

There are two different problems with Ngrams--one is of Google's
creation and that's the one Geoff and others have pointed out
repeatedly; the other comes from researchers who don't just ignore
warnings about reliability of Ngram data, but introduce new errors by
failing to consider a broader set of data (not to mention errors in
logic, etc., but these are not methodological failures). Causal
conclusions should be particularly suspect.

However stated, my criticism was not of Joel, but of his corresponded
who made the Ngram suggestion.

     VS-)

On 3/31/2012 2:09 PM, Joel S. Berson wrote:
> Geoff,
>
> Would you say that gross evidence of trends in Ngram has some
> validity, if one avoids attempted refinements such as British vs.
> American corpuses (corpi?)?
>
> For example, elsewhere I graphed "years old" vs. "years of his age",
> and concluded "The two phrases track each other remarkably closely
> (not just in relative increases of each, but in actual percentage of
> appearance) from 1700 to about 1810, when "years old" increases some
> more and "of his age" drops off significantly."  (With a caution
> about what an ngram actually plots.)  My common sense tells me that a
> drop-off of "of his age" is correct, and that it began around 1810 is
> probably correct also.  Google's dating errors likely does not
> significantly affect this.
>
> Joel
>
> At 3/31/2012 12:18 PM, Geoffrey Nunberg wrote:
>>> From: Victor Steinbok<aardvark66 at GMAIL.COM>
>>> Date: March 28, 2012 6:57:15 PM HST
>>> Subject: Re: Fwd: usage ridicule
>>>
>>> Aside from the fact that the graphs are in no way similar, there is the
>>> problem that restricting to only British English duplicates most of the
>>> picture, but not quite. Until the WWII years, "a historical" and "an
>>> historical" behave similarly. Both start a mild decline in the 1890s,
>>> but the decline is much shorter than the overall graph, continuing
>>> increasing precipitously from the late 1950s.
>>>
>>> http://goo.gl/2BUfi
>> I've written elsewhere about how dreadful the Google Books metadata
>> are (see http://chronicle.com/article/Googles-Book-Search-A/48245/
>> and http://bit.ly/H55iDu). The difficulties carry over to the ngrams
>> tool in a number of ways, One particularly egregious eg is the
>> "British English" corpus. Of the hits for "a historical) that come
>> up in this search in the "British English" corpus, the vast majority
>> are American publications, and so, too, are around half the books
>> that come up for "an historical."
>>
>> I've been collecting eg's of just how terrible a job Google Labs did
>> on this tool, and when I get around to it I'll do a LanguageLog post
>> enumerating all the problems. These are not going to be fixed;
>> Google Labs has entirely abandoned it after having gotten themselves
>> a flurry of publicity when it was first released.
>>
>> Geoff

------------------------------------------------------------
The American Dialect Society - http://www.americandialect.org