[Corpora-List] Automatic web file downloader

William Fletcher fletcher at usna.edu
Thu Sep 9 15:54:41 UTC 2010


Dear Chelo,

The Web Concordancer on my site WebAsCorpus.org offers a quick 
and easy way to compile corpora from webpages.  It is in the 
process of moving to a new cluster of servers.    

Via this temporary link 
http://phrasesinenglish.org/wac/searchwac.html

you can search for concordances of specific words or phrases 
on the search engine Bing.  

The pages concordanced during your search can be downloaded in 
a zip file in HTML and / or text format. Be sure to set the 
Download Options appropriately before starting your search.  
If you are only interested in the files, reduce the context 
size and number of matches per document to their minimum 
values.

By adding search terms to the exclude / include filters under 
the Advanced Query tab you can extend your corpus far beyond 
Bing's 1000-page-per-query limit.

Up to 20 searches can be carried out simultaneously, depending 
on other traffic on the site. To increase the number of 
simultaneous queries, or if the server is down or slow, use
http://184.154.83.122/wac/
instead.

Good luck -- and don't hesitate to contact me with any 
questions!

Regards,
Bill Fletcher



---- Original message ----
>Date: Thu, 9 Sep 2010 12:04:47 +0200
>From: corpora-bounces at uib.no (on behalf of Chelo Vargas 
<chelo.vargas at ua.es>)
>Subject: [Corpora-List] Automatic web file downloader  
>To: corpora at uib.no
>
>Dear colleagues,
>I would like to know about software used to build up a corpus 
of texts by 
>downloading web pages with the help of a search engine. I 
already know Webgetter 
>(a utility in WST), the one in Sketch Engine, and in TERMINUS 
>(http://melot.upf.edu/Terminus2009/index_es.html)
>
>Thank you very much for your help.
>
>Best wishes,
>
>****************************
>PhD. Ms Chelo Vargas-Sierra
>University of Alicante (Spain)
>Dpto. de Filología Inglesa
>Apdo. 99
>03080 Alicante
>Tlf. 96 590 3438
>
>_______________________________________________
>Corpora mailing list
>Corpora at uib.no
>http://mailman.uib.no/listinfo/corpora

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list