[Corpora-List] 155 *billion* (155, 000, 000, 000) word corpus of American English

Mark Davies Mark_Davies at byu.edu
Thu May 12 20:34:14 UTC 2011


>> However,  I was wondering if it would be conceivable to make these data available through a bunch of torrent files available via the peer 2peer bittorrent network.

Again, the raw data is already available from Google Books (http://ngrams.googlelabs.com/datasets). I doubt that it would be legal to redistribute these, and I'm not sure what the advantage would be anyway. I'd go with downloads from the "official site".

Mark D.

============================================
Mark Davies
Professor of (Corpus) Linguistics
Brigham Young University
(phone) 801-422-9168 / (fax) 801-422-0906
Web: http://davies-linguistics.byu.edu

** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
============================================
________________________________________
From: corpora-bounces at uib.no [corpora-bounces at uib.no] On Behalf Of DJamé Seddah [djame.seddah at free.fr]
Sent: Thursday, May 12, 2011 2:11 PM
To: corpora at uib.no
Subject: Re: [Corpora-List] 155 *billion* (155, 000, 000,       000) word corpus of American English

Le 12 mai 2011 à 20:37, Mark Davies a écrit :

> Angus,
>
>>> When I try to use it, I get "Session expired. Click here to start new session."
>
> Sorry, a little glitch for a few users. That's fixed now.
>
>>> Then you can appreciate how rough some of the OCR has been.
>
> Yes, I too had read about "issues" with the Google Book scans, and I was initially skeptical as well. Once I put everything into the database and started doing queries, however, I realized that the data is still very, very useful, and it provides great insight for many of the types of constructions and phenomena that I'm interested in. Try the "Five Minute Tour" at the site and decide for yourself what you think about the data in the n-grams.
>
> Michal Ptaszynski wrote:
>
>>> Luckily its downloadable (the n-grams), since you do NOT want to use a hundred-billion word corpus on an interface which blocks you for a day after 1000 queries. :D
>
> While *you personally* may not want to use a corpus with reasonable daily limits, about 100,000 other users per month *do* want to, and it's put a huge load on the one corpus server (http://corpus.byu.edu). Without some limits, no one would have any access, or everyone would have rotten, slow access. The other option is to charge for use, as is done for some other online corpora, in order to buy more servers. I've tried to keep things free, but it does entail some reasonable limits for users (1000 queries a day seems like it should work for most people). Sorry that's been so frustrating for you...


Hi,
I really appreciate the work you've put into this and this web interface would be really useful for teaching and to do preliminary investigation of course. However,
I was wondering if it would be conceivable to make these data available through a bunch of torrent files available via the peer 2peer bittorrent network.
Your institution bandwidth cost would be greatly reduced and I'm sure that a lot of us will be more than happy to seed these torrents as long as possible.

Best,

Djamé



_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list