[Corpora-List] Newly available: the 2010 portion of our USENET corpus (free for academic uses)

Cyrus Shaoul cyrus.shaoul at ualberta.ca
Thu May 26 22:10:14 UTC 2011


Dear fellow CORPORA list members,

Need something light to read this summer? How about 5,781,211,776
words of English text collected from a great variety of Internet
discussion boards last calendar year?

Slightly late, we bring you the Jan-Dec 2010 portion of the the
Westbury Lab USENET corpus. As always, this corpus contains anonymized
postings collected from 47,860 USENET newsgroups.  This new portion is
5.7Gb in size, compressed, with the total corpus weighing in at 34Gb,
compressed (over 30 billion words in all!).

To download the 2010 data as well as the previous years' corpora,
please click below:

http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html

There are bandwidth limitations in place at my educational
institution, so to make the process more efficient, please try to use
the BitTorrent method to download the corpus if you are not on the
Internet2 (instructions and links to BitTorrent clients are available
on the above web page).

I am assigning the majority of my seeding machine bandwidth
exclusively to the 2010 data for the next week, so please try to get
your torrent download started between today and June 6th. As always,
please leave your BitTorrent software open for a few days after
completing the download to help others get the parts of the file that
they need!

If you have problems with the download process, please contact me
directly, rather than this mailing list.

Thanks,

Cyrus

PS: Please make sure to cite the corpus if you use it, and read the
Creative Commons licence if you have questions about the restrictions
we put on the usage of the corpus.

-- 
http://www.ualberta.ca/~cshaoul/

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list