[Corpora-List] How to download text from the web to build a corpus ?

Thu Jun 21 13:15:39 UTC 2012

Some tools which may help you:

wget to download pages or preferably most programming languages have their
own url download libraries e.g. Python has urllib2.
justext to remove boilerplate http://code.google.com/p/justext/
Onion for deduplication http://code.google.com/p/onion/

Sketch Engine (http://www.sketchengine.co.uk/) has built WebBootCat which
makes corpus collection easy for any language (and has good
filtering techniques for removing spam pages). WebBootCat allows you to
download domain specific corpus for any language, extract keywords from the
downloaded corpus, and repetitively collect more corpora from your new key
words. Or you could try BooTCaT http://bootcat.sslmit.unibo.it/

For the kind of problems you face while building a corpus for a language,
please refer to A Corpus Factory for many languages <http://bit.ly/Mkgv14>.

best regards,
Siva

On Thu, Jun 21, 2012 at 2:55 PM, Imene Bensalem <bens.imene at gmail.com>wrote:

> Dear all,
> I would build a corpus of Arabic text, and I would ask you about tools you
> know to  download text (or html pages) form the source websites.
> I tried to use WinHTTrak to download pages form Wikipedia but
> it always show me an error and did download anything.
> Thank you
> Best regards
>
> Imene Bensalem
> Mentouri University, Constantine , Algeria
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20120621/a64e0b5d/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora