22.2067, FYI: 155 Billion Word Corpus: American English

Fri May 13 19:10:45 UTC 2011

LINGUIST List: Vol-22-2067. Fri May 13 2011. ISSN: 1068 - 4875.

Subject: 22.2067, FYI: 155 Billion Word Corpus: American English

Moderators: Anthony Aristar, Eastern Michigan U <aristar at linguistlist.org>
            Helen Aristar-Dry, Eastern Michigan U <hdry at linguistlist.org>

Reviews: Veronika Drake, U of Wisconsin-Madison  
Monica Macaulay, U of Wisconsin-Madison  
Rajiv Rao, U of Wisconsin-Madison  
Joseph Salmons, U of Wisconsin-Madison  
Anja Wanner, U of Wisconsin-Madison  
       <reviews at linguistlist.org> 

Homepage: http://linguistlist.org/

The LINGUIST List is funded by Eastern Michigan University, 
and donations from subscribers and publishers.

Editor for this issue: Danielle St. Jean <danielle at linguistlist.org>
================================================================  

To post to LINGUIST, use our convenient web form at
http://linguistlist.org/LL/posttolinguist.cfm.

===========================Directory==============================  

1)
Date: 12-May-2011
From: Mark Davies [mark_davies at byu.edu]
Subject: 155 Billion Word Corpus: American English

-------------------------Message 1 ---------------------------------- 
Date: Fri, 13 May 2011 15:08:41
From: Mark Davies [mark_davies at byu.edu]
Subject: 155 Billion Word Corpus: American English

E-mail this message to a friend:
http://linguistlist.org/issues/emailmessage/verification.cfm?iss=22-2067.html&submissionid=4519395&topicid=6&msgnumber=1

We're pleased to announce a new corpus -- the Google Books 
(American English) corpus: http://googlebooks.byu.edu/

This corpus is based on the American English portion of the Google 
Books data (see http://ngrams.googlelabs.com and especially 
http://ngrams.googlelabs.com/datasets). It contains 155 *billion* words  
(155,000,000,000) in more than 1.3 million books from the 1810s-
2000s (including 62 billion words from just 1980-2009).

The corpus has most of the functionality of the other corpora from 
http://corpus.byu.edu (e.g. COCA, COHA, and our interface to the 
BNC), including: searching by part of speech, wildcards, and lemma 
(and thus advanced syntactic searches), synonyms, collocate 
searches, frequency by decade (tables listing each individual string, or 
charts for total frequency), comparisons of two historical periods (e.g. 
collocates of "women" or "music" in the 1800s and the 1900s), and 
more.

This American English corpus is just one of seven Google Books-based 
corpora that we hope to create in the next year or two (contingent on 
funding, which we are applying for in June 2011). If funded, the other 
corpora will include British English, English from the 1500s-1700s, and 
corpora of Spanish, French, and German (see the listing at 
http://ngrams.googlelabs.com/datasets). Each of these corpora will be 
based on at least 50 billion words of data, and they should represent a 
nice addition to existing resources.

The Google Books (American English) corpus is freely-available at 
http://googlebooks.byu.edu, and we hope that it is of value to you in 
your research and teaching. 

Linguistic Field(s): Computational Linguistics
                     Text/Corpus Linguistics

Subject Language(s): English (eng)

-----------------------------------------------------------
LINGUIST List: Vol-22-2067	
----------------------------------------------------------