Corpora: Help please - downloading text from the Web

Andrew Harley aharley at cup.cam.ac.uk
Mon Mar 27 08:53:49 UTC 2000


At 11:34 AM 23/03/2000 GMT, Geoff Wilkins wrote:
>
>Hi.  Can anyone help me with the following:
>
>I'm looking for software - preferably freeware or shareware - to
>use to download text from Web sites, for use in a corpus.

For the Cambridge International Corpus, we have used the following two
products to download websites (after obtaining permission from the site
owner - an important point that shouldn't be disregarded):

>   WEBWHACKER - http://www.bluesquirrel.com/whacker
>     The original off-line browser!
>
>   GRAB-A-SITE - http://www.bluesquirrel.com/grabasite
>     An "Industrial Strength" off-line browser!

WebWhacker compresses the data while Grab-a-Site delivers it as HTML
organised in directory structures - much easier to handle for us, so we now
use Grab-a-Site.

Andrew Harley
Systems Development Manager
English Language Teaching & Dictionaries
Cambridge University Press

Direct line: (01223)325880
Fax: (01223)325850

Try Cambridge International Dictionaries online (over one and a half
million searches since August 1999) at:
http://www.cup.cam.ac.uk/elt/dictionary

We have recently published the Cambridge Dictionary of American English
(book and CD-ROM combined for only $20.95): see http://www.cup.org/esl/cdae
for more details and to order online.



More information about the Corpora mailing list