[Corpora-List] Westbury Lab English Wikipedia corpus now available. (April 2010 version)
Cyrus Shaoul
cyrus.shaoul at ualberta.ca
Thu May 20 21:57:49 UTC 2010
Dear Fellow Corpora List members:
In a similar style to our USENET corpus, we have just released the first
version of a corpus extracted from the English Wikipedia. This
was created from a snapshot taken in April, 2010. It is freely available
immediately at the following URL:
http://www.psych.ualberta.ca/~westburylab/downloads/westburylab.wikicorp.download.html
There is a complete description of the corpus on the above web page, but
here are a few quick points:
1) Size: over 900 million words in over 2.8 million documents.
2) Clean text, unprocessed and untagged.
3) Distributed as a single file (1.8Gb, compressed) with document
delimiters.
4) CC license. Please read the licensing for this corpus and for
Wikipedia carefully.
As always, it is available as a direct download to those on the
Internet2. For normal Internet connections, we
offer a BitTorrent download. If you use the BitTorrent download, please
help us synchronize the swarm by
commencing your download today, and leave your BitTorrent program
running for a few days after you
complete downloading the file. This will help others download the file,
and help you create some good karma for
yourself.
Your feedback is welcome and appreciated,
Cyrus
--
=[=]={=}=[=]={=}=[=]={=}=[=]={=}=[=]={=}
Cyrus Shaoul
http://www.psych.ualberta.ca/~westburylab/
University of Alberta
=[=]={=}=[=]={=}=[=]={=}=[=]={=}=[=]={=}
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list