[Corpora-List] 8 billion word English USENET corpus available for download. [Beta Version]

Cyrus Shaoul cyrus.shaoul at ualberta.ca
Thu Jan 25 22:29:03 UTC 2007


Fellow list members,

After getting some feedback from CORPORA-folk, I have been able to work 
out a way to distribute
a BETA VERSION of my USENET corpus to anyone who needs it over the 
Internet. It is
now available under a Creative Commons license at:

    
http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html

(It is currently around 11Gb in size(compressed),  split into smaller 
files for downloading.)

This corpus should be continuously available, so if there are any 
researchers out there who have been looking
for a freely available corpus of USENET postings to collaborate on, enjoy.

The corpus contains a large selection of newsgroups, and a very low 
percentage of non-English data. It covers the
period from Oct 2005 to last month. I will try to keep on adding new 
data to it every month,
so keep coming back if you would like updates.

Due to a network usage policy at my institution, I had to restrict the 
download service to people
who use computers that are on academic networks.
I wish I could remove this restriction, but unfortunately it is a policy 
that is I cannot do anything about, so if
you are denied access due to your network type, please don't ask me to 
make an exception.. I can't!

If you are looking for the orthographic frequencies for the most common 
tokens in the corpus, there are still available (to all) at:

    
http://www.psych.ualberta.ca/~westburylab/downloads/wlallfreq.download.html

Yours,

Cyrus

-- 
=[=]={=}=[=]={=}=[=]={=}=[=]={=}=[=]={=}
Cyrus Shaoul
http://www.ualberta.ca/~cshaoul/
=[=]={=}=[=]={=}=[=]={=}=[=]={=}=[=]={=}


-------------- next part --------------
A non-text attachment was scrubbed...
Name: cyrus.shaoul.vcf
Type: text/x-vcard
Size: 293 bytes
Desc: not available
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20070125/6dfdf37a/attachment-0001.vcf>


More information about the Corpora mailing list