Corpora: Help please - downloading text from the Web

Christian Coseru christian.coseru at anu.edu.au
Mon Mar 27 06:17:13 UTC 2000


At 11:34 AM 3/23/00 GMT, you wrote:
>
>Hi.  Can anyone help me with the following:
>
>I'm looking for software - preferably freeware or shareware - to
>use to download text from Web sites, for use in a corpus.
>Geoff Wilkins


By far the best spider (I have tested over a dozen commercialware and
shareware) is httrack
developed by Xavier Roche and Yann Philippot at CERN. The software if
freeware and is available for Unix, Linux, Solaris and Windows platforms. I
have archived sites up to 250MB in size and over 40000 files with no
difficulty at all. The spider is highly customizable, has extensive support
for JavaScript and can easily gather dynamic or database driven (e.g. asp,
cfm) web sites.

The software and the documentation can be found at http://httrack.free.fr

Christian Coseru



More information about the Corpora mailing list