[Corpora-List] Getting articles from newspapers to compile a corpus

William Fletcher fletcher at usna.edu
Sun Dec 2 19:24:07 UTC 2012


There are many free tools out there to scrape websites for specific
content.  This tutorial includes an example that is somewhat comparable:
http://net.tutsplus.com/tutorials/javascript-ajax/web-scraping-with-node-js/

You might also take a look at Bobik:
http://usebobik.com/
Bobik is a cloud-powered service for scraping websites in real time. You
can use any language you want as Bobik's own API is entirely HTTP-based.

Regards,
Bill Fletcher

On Sat, Dec 1, 2012 at 2:17 PM, Angus B. Grieve-Smith <grvsmth at panix.com>wrote:

>  On 11/29/2012 10:52 PM, True Friend wrote:
>
> I have a related question: News websites (these days) are using AJAX,
> this hides links while simultaneously generates them via javascript. See this
> page<http://www.nation.com.pk/pakistan-news-newspaper-daily-english-online/opinions/editorials>for example.
> Apparently this is the archive page for all Editorials on the newspaper
> website, but only a few are shown, and user has to click on "Show more
> news" under the given stories to get a few more previous editorials. Would
> an html crawler be able to bypass this and get all links hidden on this
> page?
>
>
>     It is possible.  Certainly, anyone with enough programming skill could
> write an HTML crawler that can give an AJAX website the information it's
> looking for.   In practice, it may be so obfuscated that it's not worth the
> time and effort.
>
> --
> Angus B. Grieve-Smithgrvsmth at panix.com
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20121202/11ee097a/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list