[Corpora-List] Our large USENET corpus is now available on Amazon Web Services (AWS).

Cyrus Shaoul cyrus.shaoul at ualberta.ca
Wed Nov 17 22:39:57 UTC 2010


Dear Corpora-list members,

After receiving many requests over the years for a better way to obtain 
our 28 billion word USENET corpus, I have recently submitted the corpus 
to Amazon Web Services, and they have graciously made it one of their 
public data sets. These data is being hosted as a public service by AWS. 
To use this
data, just set up an account at AWS and then mount the snapshot listed 
below:

http://aws.amazon.com/datasets/1679761938200766

Please let me know if this resolves the issues you have had with 
downloading the corpus. In theory, mounting and copying this dataset 
should now take minutes instead of days.

NOTE: Make sure to read the license before using our corpus as we place 
restrictions on its usage.
It is free to use for all academic and non-profit projects, but please 
cite the corpus when you
report your results!

The corpus continues to be available over BitTorrent and HTTP here:

http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html

Finally, the USENET data for Jan-Dec 2010 should be available in January 
of 2011 if all goes well.

Yours,

Cyrus


=[=]={=}=[=]={=}=[=]={=}=[=]={=}=[=]={=}
Cyrus Shaoul
http://www.psych.ualberta.ca/~westburylab/
University of Alberta
780-492-5843
=[=]={=}=[=]={=}=[=]={=}=[=]={=}=[=]={=}



_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list