[Corpora-List] Getting articles from newspapers to compile a corpus
maxwell
maxwell at umiacs.umd.edu
Thu Nov 29 20:02:50 UTC 2012
On 2012-11-29 13:21, Matías Guzmán wrote:
> I was wondering if anyone knows how to get every possible article
> from online newspapers and magazines. I was thinking something like
> giving a program the URL of the newspaper (e.g. www.eltiempo.com [1])
> and getting the text from all pages therein. Is that possible?
As someone else mentioned, wget (which last I looked runs under Windows
as well as under Linux) is one way to do this, assuming the newspaper
has archived their old issues.
There are of course cautions. Some sites will notice that you're
vacuuming everything up, and get suspicious--and they may shut you off,
particularly if you just let wget run unthrottled. You'll also have
cleanup to do, one aspect of which will be to check for duplicate (or
near-duplicate) files. And of course there are potential copyright
issues.
Mike Maxwell
University of Maryland
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list