[Corpora-List] How to download text from the web to build a corpus ?

Eleftherios Avramidis eleftherios.avramidis at dfki.de
Thu Jun 21 13:33:06 UTC 2012


Hi Imene,

if you are familiar with Python, I would suggest the scrapy project, as 
you can easily isolate parts of the page that you are interested in.

Btw, Wikipedia I think offers the possibility to download the content in 
a compressed archive. This way you avoid stressing their server.

best
Lefteris

On 21/06/12 11:25, Imene Bensalem wrote:
> Dear all,
> I would build a corpus of Arabic text, and I would ask you about tools 
> you know to  download text (or html pages) form the source websites.
> I tried to use WinHTTrak to download pages form Wikipedia but 
> it always show me an error and did download anything.
> Thank you
> Best regards
>
> Imene Bensalem
> Mentouri University, Constantine , Algeria
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora



-- 
MSc. Inf. Eleftherios Avramidis
DFKI GmbH, Alt-Moabit 91c, 10559 Berlin
Tel. +49-30 238 95-1806

Fax. +49-30 238 95-1810

-------------------------------------------------------------------------------------------
Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern

Geschaeftsfuehrung:
Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff

Vorsitzender des Aufsichtsrats:
Prof. Dr. h.c. Hans A. Aukes

Amtsgericht Kaiserslautern, HRB 2313
-------------------------------------------------------------------------------------------

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20120621/2db92d92/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list