Beyond scrapy, if you need better HTML parsing in Python, I would suggest Beautiful Soup. It's what I've used on several projects, and it's never let me down yet.<br><br>R<br clear="all"><div><br></div><div>--</div>
Richard Littauer <div>Erasmus Mundus MSc in Computational Linguistics<div>Saarland University<br><div><div><a href="http://www.rlittauer.com" target="_blank">http://www.rlittauer.com</a> | @richlitt</div></div></div></div>
<br>
<br><br><div class="gmail_quote">On Thu, Jun 21, 2012 at 3:33 PM, Eleftherios Avramidis <span dir="ltr"><<a href="mailto:eleftherios.avramidis@dfki.de" target="_blank">eleftherios.avramidis@dfki.de</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000">
<div>Hi Imene,<br>
<br>
if you are familiar with Python, I would suggest the scrapy
project, as you can easily isolate parts of the page that you are
interested in.<br>
<br>
Btw, Wikipedia I think offers the possibility to download the
content in a compressed archive. This way you avoid stressing
their server.<br>
<br>
best<br>
Lefteris<div><div class="h5"><br>
<br>
On 21/06/12 11:25, Imene Bensalem wrote:<br>
</div></div></div>
<blockquote type="cite"><div><div class="h5">Dear all,
<div>I would build a corpus of Arabic text, and I would ask you
about tools you know to download text (or html pages) form the
source websites.</div>
<div>I tried to use WinHTTrak to download pages form Wikipedia but
it always show me an error and did download anything.</div>
<div>Thank you</div>
<div>Best regards</div>
<div><br>
</div>
<div>Imene Bensalem</div>
<div>Mentouri University, Constantine , Algeria </div>
<br>
<fieldset></fieldset>
<br>
</div></div><div class="im"><pre>_______________________________________________
UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/corpora</a>
Corpora mailing list
<a href="mailto:Corpora@uib.no" target="_blank">Corpora@uib.no</a>
<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a>
</pre>
</div></blockquote>
<br>
<br>
<br>
<pre cols="72">--
MSc. Inf. Eleftherios Avramidis
DFKI GmbH, Alt-Moabit 91c, 10559 Berlin
Tel. <a href="tel:%2B49-30%20238%2095-1806" value="+4930238951806" target="_blank">+49-30 238 95-1806</a>
Fax. <a href="tel:%2B49-30%20238%2095-1810" value="+4930238951810" target="_blank">+49-30 238 95-1810</a>
-------------------------------------------------------------------------------------------
Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
Geschaeftsfuehrung:
Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats:
Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313
-------------------------------------------------------------------------------------------
</pre>
</div>
<br>_______________________________________________<br>
UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/corpora</a><br>
Corpora mailing list<br>
<a href="mailto:Corpora@uib.no">Corpora@uib.no</a><br>
<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>
<br></blockquote></div><br>