[Corpora-List] 155 billion (155, 000, 000, 000) word corpus of American English

Thu May 12 18:37:07 UTC 2011

Angus,

>>  When I try to use it, I get "Session expired. Click here to start new session."

Sorry, a little glitch for a few users. That's fixed now.

>> Then you can appreciate how rough some of the OCR has been.

Yes, I too had read about "issues" with the Google Book scans, and I was initially skeptical as well. Once I put everything into the database and started doing queries, however, I realized that the data is still very, very useful, and it provides great insight for many of the types of constructions and phenomena that I'm interested in. Try the "Five Minute Tour" at the site and decide for yourself what you think about the data in the n-grams. 

Michal Ptaszynski wrote:

>> Luckily its downloadable (the n-grams), since you do NOT want to use a hundred-billion word corpus on an interface which blocks you for a day after 1000 queries. :D

While *you personally* may not want to use a corpus with reasonable daily limits, about 100,000 other users per month *do* want to, and it's put a huge load on the one corpus server (http://corpus.byu.edu). Without some limits, no one would have any access, or everyone would have rotten, slow access. The other option is to charge for use, as is done for some other online corpora, in order to buy more servers. I've tried to keep things free, but it does entail some reasonable limits for users (1000 queries a day seems like it should work for most people). Sorry that's been so frustrating for you...

Mark D.

============================================
Mark Davies
Professor of (Corpus) Linguistics
Brigham Young University
(phone) 801-422-9168 / (fax) 801-422-0906
Web: http://davies-linguistics.byu.edu

** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
============================================

From: corpora-bounces at uib.no [corpora-bounces at uib.no] On Behalf Of Angus Grieve-Smith [grvsmth at panix.com]
Sent: Thursday, May 12, 2011 10:37 AM
To: corpora at uib.no
Subject: Re: [Corpora-List] 155 *billion* (155, 000, 000, 000) word corpus of American English

On 5/12/2011 11:15 AM, Mark Davies wrote: 
Is the corpus itself or part of it available for downloading? It would be more useful if we could process the raw text for our own purpose rather than accessing it from a web interface.

As mentioned previously, the underlying n-grams data is freely available from Google at http://ngrams.googlelabs.com/datasets (see http://creativecommons.org/licenses/by/3.0/ re. licensing).

    When I try to use it, I get "Session expired. Click here to start new session."

    In theory, though, all the books are available for free from http://books.google.com/ .  In the Google ngram interface at http://ngrams.googlelabs.com/ there are links to date ranges.  If you click on those you will see a date range result for the search term on the Google Books website.  You can then click the "Plain text" link in the upper right hand corner to see the OCRed text.  Then you can appreciate how rough some of the OCR has been.

-- 
    -Angus B. Grieve-Smith
    grvsmth at panix.com
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora