[Corpora-List] How to download text from the web to build a corpus ?

Renaud Richardet renaud.richardet at epfl.ch
Mon Jul 9 10:22:37 UTC 2012


If you want to download Wikipedia content only, you can get their
database dumps (http://dumps.wikimedia.org/) directly.

-- Renaud

On Mon, Jul 9, 2012 at 11:16 AM, Richard Littauer
<richard.littauer at gmail.com> wrote:
> Beyond scrapy, if you need better HTML parsing in Python, I would suggest
> Beautiful Soup. It's what I've used on several projects, and it's never let
> me down yet.
>
> R
>
> --
> Richard Littauer
> Erasmus Mundus MSc in Computational Linguistics
> Saarland University
> http://www.rlittauer.com | @richlitt
>
>
>
> On Thu, Jun 21, 2012 at 3:33 PM, Eleftherios Avramidis
> <eleftherios.avramidis at dfki.de> wrote:
>>
>> Hi Imene,
>>
>> if you are familiar with Python, I would suggest the scrapy project, as
>> you can easily isolate parts of the page that you are interested in.
>>
>> Btw, Wikipedia I think offers the possibility to download the content in a
>> compressed archive. This way you avoid stressing their server.
>>
>> best
>> Lefteris
>>
>>
>> On 21/06/12 11:25, Imene Bensalem wrote:
>>
>> Dear all,
>> I would build a corpus of Arabic text, and I would ask you about tools you
>> know to  download text (or html pages) form the source websites.
>> I tried to use WinHTTrak to download pages form Wikipedia but it always
>> show me an error and did download anything.
>> Thank you
>> Best regards
>>
>> Imene Bensalem
>> Mentouri University, Constantine , Algeria
>>
>>
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>>
>>
>>
>>
>> --
>> MSc. Inf. Eleftherios Avramidis
>> DFKI GmbH, Alt-Moabit 91c, 10559 Berlin
>> Tel. +49-30 238 95-1806
>>
>> Fax. +49-30 238 95-1810
>>
>>
>> -------------------------------------------------------------------------------------------
>> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
>> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
>>
>> Geschaeftsfuehrung:
>> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
>> Dr. Walter Olthoff
>>
>> Vorsitzender des Aufsichtsrats:
>> Prof. Dr. h.c. Hans A. Aukes
>>
>> Amtsgericht Kaiserslautern, HRB 2313
>>
>> -------------------------------------------------------------------------------------------
>>
>>
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list