[Corpora-List] 2009 data released today for the Westbury Lab USENET corpus.

Cyrus Shaoul cyrus.shaoul at ualberta.ca
Wed Jan 13 00:12:22 UTC 2010


Corpora-list members,

As we begin a new year, it is enjoyable to look back upon the last year 
and enjoy some of the fruits of our labors. In this
spirit, I am happy to announce the release of the latest dataset in our 
USENET corpus: the new postings from 2009. This new
archive contains 6.2 Gb of data (compressed) and many billions of words 
on of English text. As always, it is released
under a Creative Commons license and is free for non-commercial use. 
Please cite us if you do use it for any
academic work.

The primary method of obtaining the corpus is over BitTorrent, but for 
this to work well, many people
should start downloading the corpus simultaneously.
(Unlike with standard HTTP transfers, the more people who download at 
the same time, the BETTER!)
For that reason, I would encourage everyone who has been waiting for the 
2009 data to begin downloading the corpus this week (Jan 12th - Jan 19th)
, and I hope that the performance that you get is improved through the 
magic of BitTorrent.

The corpus is available here:

     
http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html

For our other software, including our HAL implementation, known as 
HiDEx, please see:

     http://www.psych.ualberta.ca/~westburylab/publications.html

Yours,

Cyrus


=[=]={=}=[=]={=}=[=]={=}=[=]={=}=[=]={=}
Cyrus Shaoul
http://www.ualberta.ca/~cshaoul/
University of Alberta
=[=]={=}=[=]={=}=[=]={=}=[=]={=}=[=]={=}



_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list