[Corpora-List] How to download text from the web to build a corpus ?

Richard Littauer richard.littauer at gmail.com
Mon Jul 9 09:16:38 UTC 2012


Beyond scrapy, if you need better HTML parsing in Python, I would suggest
Beautiful Soup. It's what I've used on several projects, and it's never let
me down yet.

R

--
Richard Littauer
Erasmus Mundus MSc in Computational Linguistics
Saarland University
http://www.rlittauer.com | @richlitt



On Thu, Jun 21, 2012 at 3:33 PM, Eleftherios Avramidis <
eleftherios.avramidis at dfki.de> wrote:

>  Hi Imene,
>
> if you are familiar with Python, I would suggest the scrapy project, as
> you can easily isolate parts of the page that you are interested in.
>
> Btw, Wikipedia I think offers the possibility to download the content in a
> compressed archive. This way you avoid stressing their server.
>
> best
> Lefteris
>
>
> On 21/06/12 11:25, Imene Bensalem wrote:
>
> Dear all,
> I would build a corpus of Arabic text, and I would ask you about tools you
> know to  download text (or html pages) form the source websites.
> I tried to use WinHTTrak to download pages form Wikipedia but
> it always show me an error and did download anything.
> Thank you
> Best regards
>
>  Imene Bensalem
> Mentouri University, Constantine , Algeria
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing listCorpora at uib.nohttp://mailman.uib.no/listinfo/corpora
>
>
>
>
> --
> MSc. Inf. Eleftherios Avramidis
> DFKI GmbH, Alt-Moabit 91c, 10559 Berlin
> Tel. +49-30 238 95-1806
>
> Fax. +49-30 238 95-1810
>
> -------------------------------------------------------------------------------------------
> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
>
> Geschaeftsfuehrung:
> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
> Dr. Walter Olthoff
>
> Vorsitzender des Aufsichtsrats:
> Prof. Dr. h.c. Hans A. Aukes
>
> Amtsgericht Kaiserslautern, HRB 2313
> -------------------------------------------------------------------------------------------
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20120709/2c456e22/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list