[Corpora-List] 155 *billion* (155, 000, 000, 000) word corpus of American English

Lushan Han lushan1 at umbc.edu
Thu May 12 15:04:20 UTC 2011


Hi Mark,

Is the corpus itself or part of it available for downloading? It would be
more useful if we could process the raw text for our own purpose rather than
accessing it from a web interface.

Best regards,
Lushan Han

On Thu, May 12, 2011 at 10:52 AM, Mark Davies <Mark_Davies at byu.edu> wrote:

> We’re pleased to announce a new corpus -- the Google Books (American
> English) corpus  (http://googlebooks.byu.edu/).
>
> This corpus is based on the American English portion of the Google Books
> data (see http://ngrams.googlelabs.com and especially
> http://ngrams.googlelabs.com/datasets). It contains 155 *billion* words
>  (155,000,000,000) in more than 1.3 million books from the 1810s-2000s
> (including 62 billion words from just 1980-2009).
>
> The corpus has most of the functionality of the other corpora from
> http://corpus.byu.edu (e.g. COCA, COHA, and our interface to the BNC),
> including: searching by part of speech, wildcards, and lemma (and thus
> advanced syntactic searches), synonyms, collocate searches, frequency by
> decade (tables listing each individual string, or charts for total
> frequency), comparisons of two historical periods (e.g. collocates of
> "women" or "music" in the 1800s and the 1900s), and more.
>
> This American English corpus is just one of seven Google Books-based
> corpora that we hope to create in the next year or two (contingent on
> funding, which we are applying for in June 2011). If funded, the other
> corpora will include British English, English from the 1500s-1700s, and
> corpora of Spanish, French, and German (see the listing at
> http://ngrams.googlelabs.com/datasets).  Each of these corpora will be
> based on at least 50 billion words of data, and they should represent a nice
> addition to existing resources.
>
> The Google Books (American English) corpus is freely-available at
> http://googlebooks.byu.edu, and we hope that it is of value to you in your
> research and teaching.
>
> ============================================
> Mark Davies
> Professor of (Corpus) Linguistics
> Brigham Young University
> (phone) 801-422-9168 / (fax) 801-422-0906
> Web: http://davies-linguistics.byu.edu
>
> ** Corpus design and use // Linguistic databases **
> ** Historical linguistics // Language variation **
> ** English, Spanish, and Portuguese **
> ============================================
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110512/eadae64e/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list