[Corpora-List] 2009 data released today for the Westbury Lab USENET corpus.
Cyrus Shaoul
cyrus.shaoul at ualberta.ca
Wed Jan 13 00:12:22 UTC 2010
Corpora-list members,
As we begin a new year, it is enjoyable to look back upon the last year
and enjoy some of the fruits of our labors. In this
spirit, I am happy to announce the release of the latest dataset in our
USENET corpus: the new postings from 2009. This new
archive contains 6.2 Gb of data (compressed) and many billions of words
on of English text. As always, it is released
under a Creative Commons license and is free for non-commercial use.
Please cite us if you do use it for any
academic work.
The primary method of obtaining the corpus is over BitTorrent, but for
this to work well, many people
should start downloading the corpus simultaneously.
(Unlike with standard HTTP transfers, the more people who download at
the same time, the BETTER!)
For that reason, I would encourage everyone who has been waiting for the
2009 data to begin downloading the corpus this week (Jan 12th - Jan 19th)
, and I hope that the performance that you get is improved through the
magic of BitTorrent.
The corpus is available here:
http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html
For our other software, including our HAL implementation, known as
HiDEx, please see:
http://www.psych.ualberta.ca/~westburylab/publications.html
Yours,
Cyrus
=[=]={=}=[=]={=}=[=]={=}=[=]={=}=[=]={=}
Cyrus Shaoul
http://www.ualberta.ca/~cshaoul/
University of Alberta
=[=]={=}=[=]={=}=[=]={=}=[=]={=}=[=]={=}
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list