[Corpora-List] 8 billion word English USENET corpus available for download. [Beta Version]
Cyrus Shaoul
cyrus.shaoul at ualberta.ca
Thu Jan 25 22:29:03 UTC 2007
Fellow list members,
After getting some feedback from CORPORA-folk, I have been able to work
out a way to distribute
a BETA VERSION of my USENET corpus to anyone who needs it over the
Internet. It is
now available under a Creative Commons license at:
http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html
(It is currently around 11Gb in size(compressed), split into smaller
files for downloading.)
This corpus should be continuously available, so if there are any
researchers out there who have been looking
for a freely available corpus of USENET postings to collaborate on, enjoy.
The corpus contains a large selection of newsgroups, and a very low
percentage of non-English data. It covers the
period from Oct 2005 to last month. I will try to keep on adding new
data to it every month,
so keep coming back if you would like updates.
Due to a network usage policy at my institution, I had to restrict the
download service to people
who use computers that are on academic networks.
I wish I could remove this restriction, but unfortunately it is a policy
that is I cannot do anything about, so if
you are denied access due to your network type, please don't ask me to
make an exception.. I can't!
If you are looking for the orthographic frequencies for the most common
tokens in the corpus, there are still available (to all) at:
http://www.psych.ualberta.ca/~westburylab/downloads/wlallfreq.download.html
Yours,
Cyrus
--
=[=]={=}=[=]={=}=[=]={=}=[=]={=}=[=]={=}
Cyrus Shaoul
http://www.ualberta.ca/~cshaoul/
=[=]={=}=[=]={=}=[=]={=}=[=]={=}=[=]={=}
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cyrus.shaoul.vcf
Type: text/x-vcard
Size: 293 bytes
Desc: not available
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20070125/6dfdf37a/attachment-0001.vcf>
More information about the Corpora
mailing list