[Corpora-List] Extracting text from Wikipedia articles
Cyrus Shaoul
cyrus.shaoul at ualberta.ca
Sat Aug 28 01:28:09 UTC 2010
Irina,
I am not sure if this helps you, but I have extracted the text for the
English version of Wikipedia (in April of this year)
using the WikiExtractor
<http://medialab.di.unipi.it/wiki/Wikipedia_Extractor> toolset and
created a 990 million word corpus that is freely available on my web site:
http://www.psych.ualberta.ca/~westburylab/downloads/westburylab.wikicorp.download.html
Yours,
Cyrus
--
=[=]={=}=[=]={=}=[=]={=}=[=]={=}=[=]={=}
Cyrus Shaoul
http://www.psych.ualberta.ca/~westburylab/
University of Alberta
=[=]={=}=[=]={=}=[=]={=}=[=]={=}=[=]={=}
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20100827/540e382a/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list