[Corpora-List] Westbury Lab English Wikipedia corpus now available. (April 2010 version)

Cyrus Shaoul cyrus.shaoul at ualberta.ca
Thu May 20 21:57:49 UTC 2010


Dear Fellow Corpora List members:

In a similar style to our USENET corpus, we have just released the first 
version of a corpus extracted from the English Wikipedia. This
was created from a snapshot taken in April, 2010. It is freely available 
immediately at the following URL:

http://www.psych.ualberta.ca/~westburylab/downloads/westburylab.wikicorp.download.html

There is a complete description of the corpus on the above web page, but 
here are a few quick points:

1) Size: over 900 million words in over 2.8 million documents.
2) Clean text, unprocessed and untagged.
3) Distributed as a single file (1.8Gb, compressed) with document 
delimiters.
4) CC license. Please read the licensing for this corpus and for 
Wikipedia carefully.

As always, it is available as a direct download to those on the 
Internet2. For normal Internet connections, we
offer a BitTorrent download. If you use the BitTorrent download, please 
help us synchronize the swarm by
commencing your download today, and leave your BitTorrent program 
running for a few days after you
complete downloading the file. This will help others download the file, 
and help you create some good karma for
yourself.

Your feedback is welcome and appreciated,

Cyrus

-- 
=[=]={=}=[=]={=}=[=]={=}=[=]={=}=[=]={=}
Cyrus Shaoul
http://www.psych.ualberta.ca/~westburylab/
University of Alberta
=[=]={=}=[=]={=}=[=]={=}=[=]={=}=[=]={=}



_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list