[Corpora-List] Getting articles from newspapers to compile a corpus

Angus B. Grieve-Smith grvsmth at panix.com
Sat Dec 1 19:17:03 UTC 2012


On 11/29/2012 10:52 PM, True Friend wrote:
> I have a related question:News websites (these days) are using AJAX, 
> this hides links while simultaneously generates them via javascript. 
> See this page 
> <http://www.nation.com.pk/pakistan-news-newspaper-daily-english-online/opinions/editorials> 
> for example. Apparently this is the archive page for all Editorials on 
> the newspaper website, but only a few are shown, and user has to click 
> on "Show more news" under the given stories to get a few more previous 
> editorials. Would an html crawler be able to bypass this and get all 
> links hidden on this page?
>

     It is possible.  Certainly, anyone with enough programming skill 
could write an HTML crawler that can give an AJAX website the 
information it's looking for.   In practice, it may be so obfuscated 
that it's not worth the time and effort.

-- 
Angus B. Grieve-Smith
grvsmth at panix.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20121201/b7df8c74/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list