<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html; charset=ISO-8859-15"

 http-equiv="Content-Type">

</head>

<body bgcolor="#ffffff" text="#000000">

Hi Irina,<br>

<br>

I recommend Antonio Fuschetto's WikiExtractor too: I used it recently

to create a corpus of texts extracted from Wikipedia and it worked like

a charm.<br>

<br>

As Goran Rakic said the site is currently down, but you can download

the original script from here (this is a temporary link, don't count on

this to stay online long):<br>

<br>

<meta http-equiv="content-type" content="text/html; charset=ISO-8859-15">

<a href="http://sslmit.unibo.it/%7Eeros/WikiExtractor.py.gz">http://sslmit.unibo.it/~eros/WikiExtractor.py.gz</a><br>

<br>

You'll need to download the XML dump from the wikipedia repository and

run the script on it, something like this:<br>

<br>

bunzip2 -c enwiki-latest-pages-articles.xml.bz2 | WikiExtractor.py -<br>

<br>

Cheers,<br>

Eros Zanchetta<br>

<br>

<pre class="moz-signature" cols="72">-- 

SITLEC

University of Bologna (Forlė)

<a class="moz-txt-link-freetext" href="http://sslmit.unibo.it/~eros/">http://sslmit.unibo.it/~eros/</a>

</pre>

<br>

On 08/27/2010 07:52 PM, Irina Temnikova wrote:

<blockquote

 cite="mid:AANLkTin0dCKimUZrO7SmJdbqb45KTdAyDFQZKpF0m02s@mail.gmail.com"

 type="cite">Dear CORPORA mailing list members,<br>

  <br>

Do any of you know of any

tool for extracting text specifically from Wikipedia articles, besides

those for extracting text from HTML pages?<br>

  <br>

I only need the title

and the text, without any of the formal elements present in every

Wikipedia article (such as "From Wikipedia, the free encyclopedia",

"This article is about ..", [edit], the list of languages,"Main

article:","Categories:") and without "Contents", "See also",

"References", "Notes" and "External links".<br>

  <br>

Can you give me any suggestions?<br>

  <br>

Thank you very much in advance,<br>

  <br>

Irina<br>

  <br>

  <pre cols="72">Irina Temnikova

PhD Student in Computational Linguistics

Editorial Assistant for the Journal of Natural Language Engineering

Research Group in Computational Linguistics

Research Institute of Information and Language Processing

University of Wolverhampton, UK</pre>

  <br>

</blockquote>

<br>

</body>

</html>