[Corpora-List] How to download text from the web to build a corpus ?

Craig Pfeifer craig.pfeifer at gmail.com
Thu Jun 21 14:07:15 UTC 2012


You might also consider commoncrawl -> http://commoncrawl.org/

Craig
______________
craig.pfeifer at gmail.com


On Thu, Jun 21, 2012 at 9:15 AM, Siva Reddy <siva at sivareddy.in> wrote:
> Some tools which may help you:
>
> wget to download pages or preferably most programming languages have their
> own url download libraries e.g. Python has urllib2.
> justext to remove boilerplate http://code.google.com/p/justext/
> Onion for deduplication http://code.google.com/p/onion/
>
> Sketch Engine (http://www.sketchengine.co.uk/) has built WebBootCat which
> makes corpus collection easy for any language (and has good
> filtering techniques for removing spam pages). WebBootCat allows you to
> download domain specific corpus for any language, extract keywords from the
> downloaded corpus, and repetitively collect more corpora from your new key
> words. Or you could try BooTCaT http://bootcat.sslmit.unibo.it/
>
> For the kind of problems you face while building a corpus for a language,
> please refer to A Corpus Factory for many languages.
>
> best regards,
> Siva
>
> On Thu, Jun 21, 2012 at 2:55 PM, Imene Bensalem <bens.imene at gmail.com>
> wrote:
>>
>> Dear all,
>> I would build a corpus of Arabic text, and I would ask you about tools you
>> know to  download text (or html pages) form the source websites.
>> I tried to use WinHTTrak to download pages form Wikipedia but
>> it always show me an error and did download anything.
>> Thank you
>> Best regards
>>
>> Imene Bensalem
>> Mentouri University, Constantine , Algeria
>>
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>>
>
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list