Dear CORPORA Mailing list Members,<br><br>I would like to thank very much everybody who replied to my question and to post a summary of the responses I received.<br><br>Best regards,<br><br>Irina Temnikova<br>

<br>

PhD Student in Computational Linguistics<br>

Editorial Assistant of the Journal of Natural Language Engineering<br>

<br>

Research Group in Computational Linguistics<br>

Research Institute of Information and Language Processing<br>

University of Wolverhampton, UK<br clear="all"><br><br>=============<br>Question:<br><br>Dear CORPORA mailing list members,<br><br>Do any of you know of any tool for extracting text specifically from<br>Wikipedia articles, besides those for extracting text from HTML pages?<br>


<br>I only need the title and the text, without any of the formal elements<br>present in every Wikipedia article (such as "From Wikipedia, the free<br>encyclopedia", "This article is about ..", [edit], the list of<br>


languages,"Main article:","Categories:") and without "Contents", "See also",<br>"References", "Notes" and "External links".<br><br>Can you give me any suggestions?<br>


<br>=============<br><br>Answers:<br><br>-------------<br>Roman Klinger wrote:<br><br>Users can add arbitrary HTML code. If you want to interpret that (to get<br>the plain text) you could use the text based web browser lynx, which can<br>


dump to a text file. That works quite well, but is a HTML extraction<br>method you excluded.<br><br>Another approach a colleague pointed me to and told me to work -- I did<br>not try it by myself -- is described here:<br>


<a href="http://evanjones.ca/software/wikipedia2text.html">http://evanjones.ca/software/wikipedia2text.html</a><br><br>-------------<br>Goran Rakic wrote:<br><br>Some time ago I have used a Python script by Antonio Fuschetto. This<br>


script can work on a Wikipedia database dump (XML file) from<br><a href="http://download.wikimedia.org">http://download.wikimedia.org</a> and knows how to process individual<br>articles, strip all Wiki tags and provide a plain text output.<br>


<br>Google shows me that the script was available from<br><a href="http://medialab.di.unipi.it/wiki/Wikipedia_Extractor">http://medialab.di.unipi.it/wiki/Wikipedia_Extractor</a> but this site<br>currently seems to be down. You can download a slightly modified version<br>


from <a href="http://alas.matf.bg.ac.rs/~mr04069/WikiExtractor.py">http://alas.matf.bg.ac.rs/~mr04069/WikiExtractor.py</a><br><br>To run the script against the downloaded database dump, pass it as a<br>standard input using shell redirection. Change the process_page() method<br>


to fit your need.<br><br>-------------<br>Srinivas Gokavarapu wrote:<br><br>This is a tool for extracting information from wikipedia.<br><a href="http://wikipedia-miner.sourceforge.net/">http://wikipedia-miner.sourceforge.net/</a> Have a look at it.<br>


<br>-------------<br>Nitin Madnani wrote:<br><br>I recently did this. I downloaded the freebase wikipedia extraction (google that) and used BeautifulSoup to extract just the text part. It was a couple of days' work at the most.<br>


<br>-------------<br>Trevor Jenkins wrote:<br><br>Your requirements are rather specific. But as (the English language)<br>WikiPedia uses a consistent markup scheme with those formal elements named<br>(either by explicit id or implicit class names in attributes) you might be<br>


able to strip out just the textual content by running a XSLT stylesheet<br>processor over the download files and delete the junk you don't want.<br><br>-------------<br>Eros Zanchetta wrote:<br><br>I recommend Antonio Fuschetto's WikiExtractor too: I used it recently<br>


to create a corpus of texts extracted from Wikipedia and it worked like<br>a charm.<br>As Goran Rakic said the site is currently down, but you can download<br>the original script from here (this is a temporary link, don't count on<br>


this to stay online long):<br>[1]<a href="http://sslmit.unibo.it/~eros/WikiExtractor.py.gz">http://sslmit.unibo.it/~eros/WikiExtractor.py.gz</a><br>You'll need to download the XML dump from the wikipedia repository and<br>


run the script on it, something like this:<br>bunzip2 -c enwiki-latest-pages-articles.xml.bz2 | WikiExtractor.py<br><br>-------------<br>Hartmut Oldenbürger wrote:<br><br>besides regarding singular, possibly costly tools, you should consider<br>


more strongly enduring, free open source means:<br><br>R is a very high script programming language, apt for text manipulation,<br>and processing, mathematical, and statistical analysis, rich graphical<br>output, controllable by several graphical user interfaces.<br>


<br>Meanwhile R is a lingua franca, available for almost all computer systems<br>at <a href="http://cran.at.r-project.org/">http://cran.at.r-project.org/</a><br>It has multi-language documentation, a journal, mailing-lists, user<br>


conferences for the worldwide experts, and users.<br><br>For your purpose within the ~2500 packages for application, there is<br><a href="http://cran.at.r-project.org/web/packages/tm/vignettes/tm.pdf">http://cran.at.r-project.org/web/packages/tm/vignettes/tm.pdf</a><br>


giving the  entrance for text mining, and corpus analysis.<br><br>After installing R, and 'tm', it will give you a basis for your<br>scientific development(s).<br>For me, it is an amazing enlightening experience since 1996/7 for<br>


developing,<br>and work.<br><br>-------------<br>Cyrus Shaoul wrote:<br><br>I am not sure if this helps you, but I have extracted the text for the<br>English version of Wikipedia (in April of this year)<br>using the WikiExtractor<br>


<<a href="http://medialab.di.unipi.it/wiki/Wikipedia_Extractor">http://medialab.di.unipi.it/wiki/Wikipedia_Extractor</a>> toolset and<br>created a 990 million word corpus that is freely available on my web site:<br>


<br><a href="http://www.psych.ualberta.ca/~westburylab/downloads/westburylab.wikicorp.download.html">http://www.psych.ualberta.ca/~westburylab/downloads/westburylab.wikicorp.download.html</a><br><br>-------------<br>Matthias Richter wrote:<br>


<br>My answer is perl and the xmldump, but there is a degree of nastyness in the details and it depends on what one expects from the quality of the<br>results.<br> <br>There is also Wikiprep from Evgeny Gabrilovich floating around that didn't exist then and that I didn't look at yet (but they are using it at Leipzig now for producing WP2010 corpora),<br>


<br>And finally <a href="http://users.softlab.ece.ntua.gr/~ttsiod/buildWikipediaOffline.html">http://users.softlab.ece.ntua.gr/~ttsiod/buildWikipediaOffline.html</a> may be a worthwhile source for readning and tinkering.<br>


<br>-------------<br>Anas Tawileh wrote:<br><br>Check this tool out (WP2TXT: Wikipedia to Text Converter):<br><a href="http://wp2txt.rubyforge.org/">http://wp2txt.rubyforge.org/</a><br><br>-------------<br>Gemma Boleda wrote:<br>


<br>we've developed a Java-based parser to do just this. It is available for download at:<br><a href="http://www.lsi.upc.edu/~nlp/wikicorpus">http://www.lsi.upc.edu/~nlp/wikicorpus</a><br><br>-------------<br>Raphael Rubino wrote:<br>


<br>I have modified this one <a href="http://www.u.arizona.edu/~jjberry/nowiki-xml2txt.py">http://www.u.arizona.edu/~jjberry/nowiki-xml2txt.py</a><br>to output trectext format which is xml, maybe the original one is good for you.<br>


<br>-------------<br>Sven Hartrumpf wrote:<br><br>We did this with the additional requirement that headings and paragraph starts are still marked up. We tested our tool only on the German Wikipedia (dewiki-20100603-pages-articles.xml); sample results can be seen here:<br>


<br><a href="http://ki220.fernuni-hagen.de/wikipedia/de/20100603/">http://ki220.fernuni-hagen.de/wikipedia/de/20100603/</a> <br><br>------------<br>Constantin Orasan wrote:<br><br>There is a version of Palinka which has a plugin to import Wikipedia articles. That version is available directly from the author.<br>


<a href="http://clg.wlv.ac.uk/projects/PALinkA/">http://clg.wlv.ac.uk/projects/PALinkA/</a><br><br>------------<br>Torsten Zesch wrote:<br><br>the Java Wikipedia Library (JWPL) contains a parser for the MediaWiki syntax that allows you (among other things) to access the plain-text of a Wikipedia article: <a href="http://www.ukp.tu-darmstadt.de/software/jwpl/">http://www.ukp.tu-darmstadt.de/software/jwpl/</a><br>


===================<br><br><br>-- <br>If you want to build a ship, don't drum up the men to gather wood, divide the work and give orders. Instead, teach them to yearn for the vast and endless sea. (Antoine de Saint-Exupery)<br>