Corpora: Help please - downloading text from the Web
Knut Hofland
Knut.Hofland at hit.uib.no
Sun Mar 26 22:45:33 UTC 2000
On Thu, 23 Mar 2000, Geoff Wilkins wrote:
> I'm looking for software - preferably freeware or shareware - to
> use to download text from Web sites, for use in a corpus.
I have used w3mir
http://www.math.uio.no/~janl/w3mir/
and
SiteSnagger
http://hotfiles.zdnet.com/cgi-bin/texis/swlib/hotfiles/info.html?fcode=000P7Z
Both have shortcomings, but I have downloaded gigabytes of HTML-files
with the programs.
With w3mir (and some home made programs) I have built a fully automatic
system for downloading all the new articles each day in 10 Norwegian
newspapers in the Web, stripping HTML-codes, indexing the text (with IMS
CWB) and making the total text searchable through a Web-browser (with a
passwd due to copyright reasons). I will present this project at LREC in
Athens later this year.
Knut Hofland | Knut.Hofland at hit.uib.no
HIT-Centre (former NCCH) | http://www.hit.uib.no/knut/
University of Bergen, | Phone: +47 5558 9463
Allegt. 27, N-5007 Bergen, Norway | Fax: +47 5558 9470
More information about the Corpora
mailing list