[Corpora-List] Getting articles from newspapers to compile a corpus

maxwell maxwell at umiacs.umd.edu
Thu Nov 29 20:02:50 UTC 2012


On 2012-11-29 13:21, Matías Guzmán wrote:
> I was wondering if anyone knows how to get every possible article
> from online newspapers and magazines. I was thinking something like
> giving a program the URL of the newspaper (e.g. www.eltiempo.com [1])
> and getting the text from all pages therein. Is that possible?

As someone else mentioned, wget (which last I looked runs under Windows 
as well as under Linux) is one way to do this, assuming the newspaper 
has archived their old issues.

There are of course cautions.  Some sites will notice that you're 
vacuuming everything up, and get suspicious--and they may shut you off, 
particularly if you just let wget run unthrottled.  You'll also have 
cleanup to do, one aspect of which will be to check for duplicate (or 
near-duplicate) files.  And of course there are potential copyright 
issues.

    Mike Maxwell
    University of Maryland

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list