Corpora: Help please - downloading text from the Web

Mon Mar 27 20:56:28 UTC 2000

Also, the Perl modules LWP, HTML, and URI provide tools for
downloading files off the web, processing them while they're
being downloaded, extracting hyperlinks and other functions.
I found this useful for repetitive site-specific tasks in which
I'd like to filter out some of the files being downloaded.

Mark Lewellen

> Subject: Corpora: Help please - downloading text from the Web
>
> Hi.  Can anyone help me with the following:
>
> I'm looking for software - preferably freeware or shareware - to
> use to download text from Web sites, for use in a corpus.
>
> This will be from large sites, with a lot of files, sub-directories
> and internal links.  Most basically, the software would simply download
> HTML files from the site, following internal links from the Home page.
> I've tried various "bots" that do this, but have had problems with all
> of them.  So I'd welcome recommendations for software that others have
> found unproblematic (and powerful/multi-functioned) for this purpose.
>
> And if anyone knows of packages that are more specifically aimed at the
> task I'm undertaking, that would be even better.
>
> Also useful would be software that mapped out the structure of
> sites, giving
> an idea of the size of the files.
>

>
> I have a related question. What tools do you use once you have downloaded
> the HTML files to (batch-)convert them in reasonably clean "plain" text?
>