[Corpora-List] Automatic web file downloader
William Fletcher
fletcher at usna.edu
Thu Sep 9 15:54:41 UTC 2010
Dear Chelo,
The Web Concordancer on my site WebAsCorpus.org offers a quick
and easy way to compile corpora from webpages. It is in the
process of moving to a new cluster of servers.
Via this temporary link
http://phrasesinenglish.org/wac/searchwac.html
you can search for concordances of specific words or phrases
on the search engine Bing.
The pages concordanced during your search can be downloaded in
a zip file in HTML and / or text format. Be sure to set the
Download Options appropriately before starting your search.
If you are only interested in the files, reduce the context
size and number of matches per document to their minimum
values.
By adding search terms to the exclude / include filters under
the Advanced Query tab you can extend your corpus far beyond
Bing's 1000-page-per-query limit.
Up to 20 searches can be carried out simultaneously, depending
on other traffic on the site. To increase the number of
simultaneous queries, or if the server is down or slow, use
http://184.154.83.122/wac/
instead.
Good luck -- and don't hesitate to contact me with any
questions!
Regards,
Bill Fletcher
---- Original message ----
>Date: Thu, 9 Sep 2010 12:04:47 +0200
>From: corpora-bounces at uib.no (on behalf of Chelo Vargas
<chelo.vargas at ua.es>)
>Subject: [Corpora-List] Automatic web file downloader
>To: corpora at uib.no
>
>Dear colleagues,
>I would like to know about software used to build up a corpus
of texts by
>downloading web pages with the help of a search engine. I
already know Webgetter
>(a utility in WST), the one in Sketch Engine, and in TERMINUS
>(http://melot.upf.edu/Terminus2009/index_es.html)
>
>Thank you very much for your help.
>
>Best wishes,
>
>****************************
>PhD. Ms Chelo Vargas-Sierra
>University of Alicante (Spain)
>Dpto. de Filología Inglesa
>Apdo. 99
>03080 Alicante
>Tlf. 96 590 3438
>
>_______________________________________________
>Corpora mailing list
>Corpora at uib.no
>http://mailman.uib.no/listinfo/corpora
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list