Corpora: Help please - downloading text from the Web

Sun Mar 26 22:45:33 UTC 2000

On Thu, 23 Mar 2000, Geoff Wilkins wrote:

> I'm looking for software - preferably freeware or shareware - to
> use to download text from Web sites, for use in a corpus.

I have used w3mir
http://www.math.uio.no/~janl/w3mir/
and
SiteSnagger
http://hotfiles.zdnet.com/cgi-bin/texis/swlib/hotfiles/info.html?fcode=000P7Z
Both have shortcomings, but I have downloaded gigabytes of HTML-files
with the programs.

With w3mir (and some home made programs) I have built a fully automatic
system for downloading all the new articles each day in 10 Norwegian
newspapers in the Web, stripping HTML-codes, indexing the text (with IMS
CWB) and making the total text searchable through a Web-browser (with a
passwd due to copyright reasons). I will present this project at LREC in
Athens later this year.

Knut Hofland                            |  Knut.Hofland at hit.uib.no
HIT-Centre (former NCCH)                |  http://www.hit.uib.no/knut/
University of Bergen,                   |  Phone: +47 5558 9463
Allegt. 27, N-5007 Bergen, Norway       |  Fax:   +47 5558 9470