[Corpora-List] Extracting text from Wikipedia articles - Summary

Irina Temnikova irina.temnikova at gmail.com
Wed Sep 8 23:11:04 UTC 2010


Dear CORPORA Mailing list Members,

I would like to thank very much everybody who replied to my question and to
post a summary of the responses I received.

Best regards,

Irina Temnikova

PhD Student in Computational Linguistics
Editorial Assistant of the Journal of Natural Language Engineering

Research Group in Computational Linguistics
Research Institute of Information and Language Processing
University of Wolverhampton, UK


=============
Question:

Dear CORPORA mailing list members,

Do any of you know of any tool for extracting text specifically from
Wikipedia articles, besides those for extracting text from HTML pages?

I only need the title and the text, without any of the formal elements
present in every Wikipedia article (such as "From Wikipedia, the free
encyclopedia", "This article is about ..", [edit], the list of
languages,"Main article:","Categories:") and without "Contents", "See also",
"References", "Notes" and "External links".

Can you give me any suggestions?

=============

Answers:

-------------
Roman Klinger wrote:

Users can add arbitrary HTML code. If you want to interpret that (to get
the plain text) you could use the text based web browser lynx, which can
dump to a text file. That works quite well, but is a HTML extraction
method you excluded.

Another approach a colleague pointed me to and told me to work -- I did
not try it by myself -- is described here:
http://evanjones.ca/software/wikipedia2text.html

-------------
Goran Rakic wrote:

Some time ago I have used a Python script by Antonio Fuschetto. This
script can work on a Wikipedia database dump (XML file) from
http://download.wikimedia.org and knows how to process individual
articles, strip all Wiki tags and provide a plain text output.

Google shows me that the script was available from
http://medialab.di.unipi.it/wiki/Wikipedia_Extractor but this site
currently seems to be down. You can download a slightly modified version
from http://alas.matf.bg.ac.rs/~mr04069/WikiExtractor.py

To run the script against the downloaded database dump, pass it as a
standard input using shell redirection. Change the process_page() method
to fit your need.

-------------
Srinivas Gokavarapu wrote:

This is a tool for extracting information from wikipedia.
http://wikipedia-miner.sourceforge.net/ Have a look at it.

-------------
Nitin Madnani wrote:

I recently did this. I downloaded the freebase wikipedia extraction (google
that) and used BeautifulSoup to extract just the text part. It was a couple
of days' work at the most.

-------------
Trevor Jenkins wrote:

Your requirements are rather specific. But as (the English language)
WikiPedia uses a consistent markup scheme with those formal elements named
(either by explicit id or implicit class names in attributes) you might be
able to strip out just the textual content by running a XSLT stylesheet
processor over the download files and delete the junk you don't want.

-------------
Eros Zanchetta wrote:

I recommend Antonio Fuschetto's WikiExtractor too: I used it recently
to create a corpus of texts extracted from Wikipedia and it worked like
a charm.
As Goran Rakic said the site is currently down, but you can download
the original script from here (this is a temporary link, don't count on
this to stay online long):
[1]http://sslmit.unibo.it/~eros/WikiExtractor.py.gz
You'll need to download the XML dump from the wikipedia repository and
run the script on it, something like this:
bunzip2 -c enwiki-latest-pages-articles.xml.bz2 | WikiExtractor.py

-------------
Hartmut Oldenbürger wrote:

besides regarding singular, possibly costly tools, you should consider
more strongly enduring, free open source means:

R is a very high script programming language, apt for text manipulation,
and processing, mathematical, and statistical analysis, rich graphical
output, controllable by several graphical user interfaces.

Meanwhile R is a lingua franca, available for almost all computer systems
at http://cran.at.r-project.org/
It has multi-language documentation, a journal, mailing-lists, user
conferences for the worldwide experts, and users.

For your purpose within the ~2500 packages for application, there is
http://cran.at.r-project.org/web/packages/tm/vignettes/tm.pdf
giving the  entrance for text mining, and corpus analysis.

After installing R, and 'tm', it will give you a basis for your
scientific development(s).
For me, it is an amazing enlightening experience since 1996/7 for
developing,
and work.

-------------
Cyrus Shaoul wrote:

I am not sure if this helps you, but I have extracted the text for the
English version of Wikipedia (in April of this year)
using the WikiExtractor
<http://medialab.di.unipi.it/wiki/Wikipedia_Extractor> toolset and
created a 990 million word corpus that is freely available on my web site:

http://www.psych.ualberta.ca/~westburylab/downloads/westburylab.wikicorp.download.html

-------------
Matthias Richter wrote:

My answer is perl and the xmldump, but there is a degree of nastyness in the
details and it depends on what one expects from the quality of the
results.

There is also Wikiprep from Evgeny Gabrilovich floating around that didn't
exist then and that I didn't look at yet (but they are using it at Leipzig
now for producing WP2010 corpora),

And finally
http://users.softlab.ece.ntua.gr/~ttsiod/buildWikipediaOffline.html may be a
worthwhile source for readning and tinkering.

-------------
Anas Tawileh wrote:

Check this tool out (WP2TXT: Wikipedia to Text Converter):
http://wp2txt.rubyforge.org/

-------------
Gemma Boleda wrote:

we've developed a Java-based parser to do just this. It is available for
download at:
http://www.lsi.upc.edu/~nlp/wikicorpus

-------------
Raphael Rubino wrote:

I have modified this one http://www.u.arizona.edu/~jjberry/nowiki-xml2txt.py
to output trectext format which is xml, maybe the original one is good for
you.

-------------
Sven Hartrumpf wrote:

We did this with the additional requirement that headings and paragraph
starts are still marked up. We tested our tool only on the German Wikipedia
(dewiki-20100603-pages-articles.xml); sample results can be seen here:

http://ki220.fernuni-hagen.de/wikipedia/de/20100603/

------------
Constantin Orasan wrote:

There is a version of Palinka which has a plugin to import Wikipedia
articles. That version is available directly from the author.
http://clg.wlv.ac.uk/projects/PALinkA/

------------
Torsten Zesch wrote:

the Java Wikipedia Library (JWPL) contains a parser for the MediaWiki syntax
that allows you (among other things) to access the plain-text of a Wikipedia
article: http://www.ukp.tu-darmstadt.de/software/jwpl/
===================


-- 
If you want to build a ship, don't drum up the men to gather wood, divide
the work and give orders. Instead, teach them to yearn for the vast and
endless sea. (Antoine de Saint-Exupery)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20100909/24c1b315/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list