<p>If you use NLTK there is special module that allows you to grab the HTML from URL, strip out all the tags and get the text only.</p>

<p>Is this what you are looking for?</p>

<div class="gmail_quote">On Nov 29, 2012 10:21 AM, "Matías Guzmán" <<a href="mailto:mortem.dei@gmail.com">mortem.dei@gmail.com</a>> wrote:<br type="attribution"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Hi all,<br><br>I was wondering if anyone knows how to get every possible article from online newspapers and magazines. I was thinking something like giving a program the URL of the newspaper (e.g. <a href="http://www.eltiempo.com" target="_blank">www.eltiempo.com</a>) and getting the text from all pages therein. Is that possible?<br>


<br>Thanks a lot,<br><br>Matías<br>

<br>_______________________________________________<br>

UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/corpora</a><br>

Corpora mailing list<br>

<a href="mailto:Corpora@uib.no">Corpora@uib.no</a><br>

<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>

<br></blockquote></div>