Google and "culturomics"

Paul B. Gallagher paulbg at PBG-TRANSLATIONS.COM
Wed Dec 22 09:08:52 UTC 2010


Several news media published stories on Friday about a new Google tool 
<http://ngrams.googlelabs.com/> that allows the user to graph the 
frequency of words and phrases (up to five words long) in a huge corpus 
over time. See, for example:

<http://www.guardian.co.uk/science/2010/dec/16/culturomics-google-tool-cultural-trends>
<http://www.nytimes.com/2010/12/17/books/17words.html>

There's also a Language Log thread:
<http://languagelog.ldc.upenn.edu/nll/?p=2848>

And a Science article:
<http://www.sciencemag.org/content/early/2010/12/15/science.1199644>
Quantitative Analysis of Culture Using Millions of Digitized Books
Jean-Baptiste Michel et al.

Abstract: We constructed a corpus of digitized texts containing about 4% 
of all books ever printed. Analysis of this corpus enables us to 
investigate cultural trends quantitatively. We survey the vast terrain 
of "culturomics", focusing on linguistic and cultural phenomena that 
were reflected in the English language between 1800 and 2000. We show 
how this approach can provide insights about fields as diverse as 
lexicography, the evolution of grammar, collective memory, the adoption 
of technology, the pursuit of fame, censorship, and historical 
epidemiology. "Culturomics" extends the boundaries of rigorous 
quantitative inquiry to a wide array of new phenomena spanning the 
social sciences and the humanities.

Full text: <http://www.sciencemag.org/content/330/6011/1600.full.pdf>


Fair warning: this thing is addictive.

I asked the tool to plot "data is" vs. "data are," and found that the 
plural usage peaked about 1983, but has tailed off since, while the 
singular peaked about 1990 and has leveled off since, but surprisingly, 
the singular usage is still about a third less common. A similar pattern 
can be seen for "media" -- the singular usage is growing, but has not 
caught up to the plural.

I also tried the Russian corpus, and learned that "на Украине" has 
bounced around as Ukraine was more or less a topic of conversation, but 
"в Украине" clung to the floor until about 1990, when it suddenly took 
off, nearly catching its traditional counterpart in 1999 before falling 
back to about half the latter's frequency.

-- 
War doesn't determine who's right, just who's left.
--
Paul B. Gallagher
pbg translations, inc.
"Russian Translations That Read Like Originals"
http://pbg-translations.com

-------------------------------------------------------------------------
 Use your web browser to search the archives, control your subscription
  options, and more.  Visit and bookmark the SEELANGS Web Interface at:
                    http://seelangs.home.comcast.net/
-------------------------------------------------------------------------



More information about the SEELANG mailing list