[Corpora-List] Extracting text from Wikipedia articles

Cyrus Shaoul cyrus.shaoul at ualberta.ca
Sat Aug 28 01:28:09 UTC 2010


  Irina,

I am not sure if this helps you, but I have extracted the text for the 
English version of Wikipedia (in April of this year)
using the WikiExtractor 
<http://medialab.di.unipi.it/wiki/Wikipedia_Extractor> toolset and 
created a 990 million word corpus that is freely available on my web site:

http://www.psych.ualberta.ca/~westburylab/downloads/westburylab.wikicorp.download.html

Yours,

Cyrus

-- 
=[=]={=}=[=]={=}=[=]={=}=[=]={=}=[=]={=}
Cyrus Shaoul
http://www.psych.ualberta.ca/~westburylab/
University of Alberta
=[=]={=}=[=]={=}=[=]={=}=[=]={=}=[=]={=}

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20100827/540e382a/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list