[Corpora-List] Getting articles from newspapers to compile a corpus

Valerio Basile v.basile at rug.nl
Thu Nov 29 20:47:17 UTC 2012


>> I was wondering if anyone knows how to get every possible article
>> from online newspapers and magazines. I was thinking something like
>> giving a program the URL of the newspaper (e.g. www.eltiempo.com [1])

For the Groningen Meaning Bank we downloaded approx. five years of the
american online newspaper Voice of America: http://www.voanews.com/
We used wget for it, but, as Mark pointed out, it's a good practice to
put a cap on the rate at which you download data from their server.
One reason we choose VoA is that its text is in the public domain,
that is, everyone is free to redistribute it. This is something you
may want to look for if you are building a corpus, if you want to
distribute the raw data along with your annotation.

What language/variety were you looking for?

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list