[Corpora-List] Getting articles from newspapers to compile a corpus

True Friend true.friend2004 at gmail.com
Fri Nov 30 03:52:21 UTC 2012


I have a related question: News websites (these days) are using AJAX, this
hides links while simultaneously generates them via javascript. See this
page<http://www.nation.com.pk/pakistan-news-newspaper-daily-english-online/opinions/editorials>for
example.
Apparently this is the archive page for all Editorials on the newspaper
website, but only a few are shown, and user has to click on "Show more news"
under the given stories to get a few more previous editorials. Would an
html crawler be able to bypass this and get all links hidden on this page?
Regards


On Fri, Nov 30, 2012 at 8:35 AM, Angus Grieve-Smith <grvsmth at panix.com>wrote:

>  On 11/29/2012 4:28 PM, Linda Bawcom wrote:
>
>  Because so many newspapers get their information from the same news
> services, I found a few articles that I had to disgard because of an over
> 80%  similarity ratio and of course that skews statistics.
>
>
>     Good point!  Some newspapers will abridge the wire stories more than
> others, so it might be useful to find a way to choose the longest version.
>
> --
> 				-Angus B. Grieve-Smith
> 				grvsmth at panix.com
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>


-- 
*Muhammad Shakir Aziz* *محمد شاکر عزیز*
*Master in Applied Linguistics
Translator, Course Developer, Linguist for Urdu, Punjabi and English*
Urdu:- http://awaz-e-dost.blogspot.com/
English:- http://linguisticslearner.blogspot.com/
Facebook:- http://www.facebook.com/truefriend2004
Skype:- true_friend2004
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20121130/f24f0c86/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list