COHA/COCA meets Google Books

Neal Whitman nwhitman at AMERITECH.NET
Fri May 13 00:43:43 UTC 2011


For those who may not have gotten this message directly:

> From: Mark Davies <mark_davies at BYU.EDU>
> Date: May 12, 2011 7:20:05 PM EDT
> To: CORPORA at LISTSERV.BYU.EDU
> Subject: 155 *billion* (155,000,000,000) word corpus of American English
> Reply-To: "Users of corpus.byu.edu" <CORPORA at LISTSERV.BYU.EDU>
> 

> This email is being sent to people who 1) have registered for the corpora 
> at http://corpus.byu.edu 2) have identified themselves as a "researcher" 
> and 3) have used the corpora several times in the last few months.
> 
> --------------------------------
> 
> We’re pleased to announce a new corpus -- the Google Books (American 
> English) corpus: http://googlebooks.byu.edu/.
> 
> This corpus is based on the American English portion of the Google Books 
> data (see http://ngrams.googlelabs.com and especially 
> http://ngrams.googlelabs.com/datasets). It contains 155 *billion* words  
> (155,000,000,000) in more than 1.3 million books from the 1810s-2000s 
> (including 62 billion words from just 1980-2009).
> 
> The corpus has most of the functionality of the other corpora from 
> http://corpus.byu.edu (e.g. COCA, COHA, and our interface to the BNC), 
> including: searching by part of speech, wildcards, and lemma (and thus 
> advanced syntactic searches), synonyms, collocate searches, frequency by 
> decade (tables listing each individual string, or charts for total 
> frequency), comparisons of two historical periods (e.g. collocates 
> of "women" or "music" in the 1800s and the 1900s), and more.
> 
> This American English corpus is just one of seven Google Books-based 
> corpora that we hope to create in the next year or two (contingent on 
> funding, which we are applying for in June 2011). If funded, the other 
> corpora will include British English, English from the 1500s-1700s, and 
> corpora of Spanish, French, and German (see the listing at 
> http://ngrams.googlelabs.com/datasets).  Each of these corpora will be 
> based on at least 50 billion words of data, and they should represent a 
> nice addition to existing resources.
> 
> The Google Books (American English) corpus is freely-available at 
> http://googlebooks.byu.edu, and we hope that it is of value to you in your 
> research and teaching.
> 
> ============================================
> Mark Davies
> Professor of (Corpus) Linguistics
> Brigham Young University
> (phone) 801-422-9168 / (fax) 801-422-0906
> Web: http://davies-linguistics.byu.edu
> 
> ** Corpus design and use // Linguistic databases **
> ** Historical linguistics // Language variation **
> ** English, Spanish, and Portuguese **
> ============================================

------------------------------------------------------------
The American Dialect Society - http://www.americandialect.org



More information about the Ads-l mailing list