Corpora: Help please - downloading text from the Web

Geoff Wilkins geoffw at cobuild.collins.co.uk
Thu Mar 23 11:34:28 UTC 2000


Hi.  Can anyone help me with the following:

I'm looking for software - preferably freeware or shareware - to
use to download text from Web sites, for use in a corpus.

This will be from large sites, with a lot of files, sub-directories
and internal links.  Most basically, the software would simply download
HTML files from the site, following internal links from the Home page.
I've tried various "bots" that do this, but have had problems with all
of them.  So I'd welcome recommendations for software that others have
found unproblematic (and powerful/multi-functioned) for this purpose.

And if anyone knows of packages that are more specifically aimed at the
task I'm undertaking, that would be even better.

Also useful would be software that mapped out the structure of sites, giving
an idea of the size of the files.

Geoff Wilkins



More information about the Corpora mailing list