[Corpora-List] Extracting text from Wikipedia articles

Goran Rakic grakic at devbase.net
Fri Aug 27 18:16:47 UTC 2010


Dear Irina,

Some time ago I have used a Python script by Antonio Fuschetto. This
script can work on a Wikipedia database dump (XML file) from
http://download.wikimedia.org and knows how to process individual
articles, strip all Wiki tags and provide a plain text output.

Google shows me that the script was available from
http://medialab.di.unipi.it/wiki/Wikipedia_Extractor but this site
currently seems to be down. You can download a slightly modified version
from http://alas.matf.bg.ac.rs/~mr04069/WikiExtractor.py

To run the script against the downloaded database dump, pass it as a
standard input using shell redirection. Change the process_page() method
to fit your need.

Kind regards,
Goran Rakic



_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list