[Corpora-List] Getting articles from newspapers to compile a corpus

Toddy Mladenov me at toddysm.com
Thu Nov 29 18:57:41 UTC 2012


If you use NLTK there is special module that allows you to grab the HTML
from URL, strip out all the tags and get the text only.

Is this what you are looking for?
On Nov 29, 2012 10:21 AM, "Matías Guzmán" <mortem.dei at gmail.com> wrote:

> Hi all,
>
> I was wondering if anyone knows how to get every possible article from
> online newspapers and magazines. I was thinking something like giving a
> program the URL of the newspaper (e.g. www.eltiempo.com) and getting the
> text from all pages therein. Is that possible?
>
> Thanks a lot,
>
> Matías
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20121129/2a1b8e09/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list