[Corpora-List] Extracting text from Wikipedia articles
Goran Rakic
grakic at devbase.net
Fri Aug 27 18:16:47 UTC 2010
Dear Irina,
Some time ago I have used a Python script by Antonio Fuschetto. This
script can work on a Wikipedia database dump (XML file) from
http://download.wikimedia.org and knows how to process individual
articles, strip all Wiki tags and provide a plain text output.
Google shows me that the script was available from
http://medialab.di.unipi.it/wiki/Wikipedia_Extractor but this site
currently seems to be down. You can download a slightly modified version
from http://alas.matf.bg.ac.rs/~mr04069/WikiExtractor.py
To run the script against the downloaded database dump, pass it as a
standard input using shell redirection. Change the process_page() method
to fit your need.
Kind regards,
Goran Rakic
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list