Beyond scrapy, if you need better HTML parsing in Python, I would suggest Beautiful Soup. It's what I've used on several projects, and it's never let me down yet.<br><br>R<br clear="all"><div><br></div><div>--</div>

Richard Littauer <div>Erasmus Mundus MSc in Computational Linguistics<div>Saarland University<br><div><div><a href="http://www.rlittauer.com" target="_blank">http://www.rlittauer.com</a> | @richlitt</div></div></div></div>

<br>

<br><br><div class="gmail_quote">On Thu, Jun 21, 2012 at 3:33 PM, Eleftherios Avramidis <span dir="ltr"><<a href="mailto:eleftherios.avramidis@dfki.de" target="_blank">eleftherios.avramidis@dfki.de</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

  <div bgcolor="#FFFFFF" text="#000000">

    <div>Hi Imene,<br>

      <br>

      if you are familiar with Python, I would suggest the scrapy

      project, as you can easily isolate parts of the page that you are

      interested in.<br>

      <br>

      Btw, Wikipedia I think offers the possibility to download the

      content in a compressed archive. This way you avoid stressing

      their server.<br>

      <br>

      best<br>

      Lefteris<div><div class="h5"><br>

      <br>

      On 21/06/12 11:25, Imene Bensalem wrote:<br>

    </div></div></div>

    <blockquote type="cite"><div><div class="h5">Dear all, 

      <div>I would build a corpus of Arabic text, and I would ask you

        about tools you know to  download text (or html pages) form the

        source websites.</div>

      <div>I tried to use WinHTTrak to download pages form Wikipedia but

        it always show me an error and did download anything.</div>

      <div>Thank you</div>

      <div>Best regards</div>

      <div><br>

      </div>

      <div>Imene Bensalem</div>

      <div>Mentouri University, Constantine , Algeria </div>

      <br>

      <fieldset></fieldset>

      <br>

      </div></div><div class="im"><pre>_______________________________________________

UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/corpora</a>

Corpora mailing list

<a href="mailto:Corpora@uib.no" target="_blank">Corpora@uib.no</a>

<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a>

</pre>

    </div></blockquote>

    <br>

    <br>

    <br>

    <pre cols="72">-- 

MSc. Inf. Eleftherios Avramidis

DFKI GmbH, Alt-Moabit 91c, 10559 Berlin

Tel. <a href="tel:%2B49-30%20238%2095-1806" value="+4930238951806" target="_blank">+49-30 238 95-1806</a>

Fax. <a href="tel:%2B49-30%20238%2095-1810" value="+4930238951810" target="_blank">+49-30 238 95-1810</a> 

-------------------------------------------------------------------------------------------

Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH

Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern

Geschaeftsfuehrung:

Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)

Dr. Walter Olthoff

Vorsitzender des Aufsichtsrats:

Prof. Dr. h.c. Hans A. Aukes

Amtsgericht Kaiserslautern, HRB 2313

-------------------------------------------------------------------------------------------

</pre>

  </div>

<br>_______________________________________________<br>

UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/corpora</a><br>

Corpora mailing list<br>

<a href="mailto:Corpora@uib.no">Corpora@uib.no</a><br>

<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>

<br></blockquote></div><br>