One happy language!

victor steinbok aardvark66 at GMAIL.COM
Wed Aug 31 21:53:25 UTC 2011


Wait! You mean 1 billion words on NYT and 360 billion words from GB boil
down to 10000 unique words? How is that possible? It certainly sounds like
there might have been some selection bias--it was just better hidden than
merely picking up 10000 words from questionable sources. But, more to the
point, the result of the study reflects the scale bias of the
researchers--there is absolutely no indication of objectivity (nor is any
possible) in ranking the words. This is simply a classic error that creeps
up in most social sciences--attaching a random scale to non-quantifiable
data will get you a neat numerical result, but will be totally devoid of
actual meaning. Another recent classic in the same genre is UCLA Prof Tim
Groseclose's book Left Turn: How Liberal Media Bias Distorts the American
Mind (with Jeff Milyo), which Groseclose is peddling the last couple of days
as a guest blogger on Volokh Conspiracy. Geoff Nunberg took the book apart
on Language Log.

http://goo.gl/AOjOc

> But sand sifted statistically is still sand. If you take the trouble to
> read the study carefully, it turns out to be based on unsupported,
> ideology-driven premises and to raise what would it would be most polite to
> describe as severe issues of data quality, however earnestly Groseclose and
> Milyo crunched their numbers.


The simple principle here is GIGO--no matter how nicely the numbers are
tabulated.

VS-)


On Wed, Aug 31, 2011 at 4:21 PM, Ben Zimmer
<bgzimmer at babel.ling.upenn.edu>wrote:

>
> Not to dampen your skepticism, Jon, but that's 10,000 *unique* words
> (types, not
> tokens). If you look at the study, you'll see they analyzed 9 billion words
> from
> Twitter, 360 billion words from Google Books, 1 billion words from The New
> York
> Times, and 59 million words from song lyrics. Presumably enough data to
> overcome stylistic biases in the source material.
>
> --bgz

------------------------------------------------------------
The American Dialect Society - http://www.americandialect.org



More information about the Ads-l mailing list