[Corpora-List] Getting articles from newspapers to compile a corpus

Khalid CHOUKRI choukri at elda.org
Thu Nov 29 21:16:13 UTC 2012


Hi Matías

which languages and domains are you looking for and what sizes? and are 
you looking for monolingual data?
ELRA regularly collects such data (after negotiating the rights), we may 
have something to share with you.
Best regards
Khalid

Matías Guzmán wrote, On 29/11/2012 19:21:
> Hi all,
>
> I was wondering if anyone knows how to get every possible article from
> online newspapers and magazines. I was thinking something like giving a
> program the URL of the newspaper (e.g. www.eltiempo.com) and getting the
> text from all pages therein. Is that possible?
>
> Thanks a lot,
>
> Matías
>
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

-- 
*Khalid Choukri *
ELRA General secretary & ELDA CEO
email: choukri at elda.org;
Web: www.elra.info www.elda.org
Tel. +33 1 43 13 33 33 - Fax. +33 1 43 13 33 30

****************************************************
** Info on LREC 2012 : www.lrec-conf.org
***************************************************
*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20121129/7e706e77/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list