[Corpora-List] How to download text from the web to build a corpus ?

Wei Lee Woon wlwoon at gmail.com
Thu Jun 21 13:19:21 UTC 2012


The function "urlopen" from the package urllib in python does it in
one line of code ;)



On 21 June 2012 14:44, Alexandre Trilla <alex at atrilla.net> wrote:
> Perhaps for more specific purposes you could make use of an advanced
> scraping service like Hubify.com
>
> Alex
>
>
>> Hello Imene,
>>
>> The utility `wget' which is available in most Unix-like OSes might be
>> useful for you.
>>
>> With kind regards,
>> Vladimir
>>
>> 2012/6/21 Imene Bensalem <bens.imene at gmail.com>:
>>> Dear all,
>>> I would build a corpus of Arabic text, and I would ask you about tools
>>> you
>>> know to  download text (or html pages) form the source websites.
>>> I tried to use WinHTTrak to download pages form Wikipedia but
>>> it always show
>>> me an error and did download anything.
>>> Thank you
>>> Best regards
>>>
>>> Imene Bensalem
>>> Mentouri University, Constantine , Algeria
>>>
>>> _______________________________________________
>>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>>> Corpora mailing list
>>> Corpora at uib.no
>>> http://mailman.uib.no/listinfo/corpora
>>>
>>
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>>
>
>
> --
> _________________________________________________
>
>  ALEXANDRE TRILLA
>  B.Sc., M.Sc. in Electronics, Telecommunications
>  Engineering and Information Technology
>
>  Email: alex at atrilla.net
>  Homepage: http://atrilla.net
> _________________________________________________
>
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list