[Corpora-List] Newly available: the 2010 portion of our USENET corpus (free for academic uses)

Kim Witten kimwitten at gmail.com
Fri May 27 11:47:05 UTC 2011


Dear Cyrus,
This is wonderful! Will you be running the orthographic frequencies on this 2010 corpus data as well? I would find that especially relevant to my research, and it would additionally be great to compare the 2010 frequencies for particular words to the 2005-2006 USENET frequencies for those words. 
Thanks!
-Kim
---
Kim Witten, PhD candidate 
Language & Linguistic Science 
the University of York, UK

> From: Cyrus Shaoul <cyrus.shaoul at ualberta.ca>
> Date: May 26, 2011 11:10:14 PM GMT+01:00
> To: corpora at hd.uib.no
> Subject: [Corpora-List] Newly available: the 2010 portion of our USENET corpus (free for academic uses)
> 
> Dear fellow CORPORA list members,
> 
> Need something light to read this summer? How about 5,781,211,776
> words of English text collected from a great variety of Internet
> discussion boards last calendar year?
> 
> Slightly late, we bring you the Jan-Dec 2010 portion of the the
> Westbury Lab USENET corpus. As always, this corpus contains anonymized
> postings collected from 47,860 USENET newsgroups.  This new portion is
> 5.7Gb in size, compressed, with the total corpus weighing in at 34Gb,
> compressed (over 30 billion words in all!).
> 
> To download the 2010 data as well as the previous years' corpora,
> please click below:
> 
> http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html
> 
> There are bandwidth limitations in place at my educational
> institution, so to make the process more efficient, please try to use
> the BitTorrent method to download the corpus if you are not on the
> Internet2 (instructions and links to BitTorrent clients are available
> on the above web page).
> 
> I am assigning the majority of my seeding machine bandwidth
> exclusively to the 2010 data for the next week, so please try to get
> your torrent download started between today and June 6th. As always,
> please leave your BitTorrent software open for a few days after
> completing the download to help others get the parts of the file that
> they need!
> 
> If you have problems with the download process, please contact me
> directly, rather than this mailing list.
> 
> Thanks,
> 
> Cyrus
> 
> PS: Please make sure to cite the corpus if you use it, and read the
> Creative Commons licence if you have questions about the restrictions
> we put on the usage of the corpus.
> 
> -- 
> http://www.ualberta.ca/~cshaoul/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110527/7e9331b2/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list