[Corpora-List] How to download text from the web to build a corpus ?

Motaz SAAD motaz.saad at inria.fr
Fri Jun 22 09:05:37 UTC 2012


Hello, 

You may use wikipedia dumps and extract needed information from it. 
You may also interested in OSAC Arabic corpora http://sites.google.com/site/motazsite/Home/osac 

good luck, 
Motaz 

----- Original Message -----

> From: "Julien Nioche" <lists.digitalpebble at gmail.com>
> To: "corpora" <CORPORA at uib.no>
> Sent: Thursday, June 21, 2012 5:27:01 PM
> Subject: Re: [Corpora-List] How to download text from the web to
> build a corpus ?

> Or Apache Nutch for crawling then Behemoth for text processing (Tika,
> GATE, UIMA, Language Id, etc...).
> http://commoncrawl.org/ is indeed an excellent resource and there is
> a module for ingesting the WARC files in Behemoth

> On 21 June 2012 15:07, Craig Pfeifer < craig.pfeifer at gmail.com >
> wrote:

> > You might also consider commoncrawl -> http://commoncrawl.org/
> 

> > Craig
> 
> > ______________
> 
> > craig.pfeifer at gmail.com
> 

> > On Thu, Jun 21, 2012 at 9:15 AM, Siva Reddy < siva at sivareddy.in >
> > wrote:
> 
> > > Some tools which may help you:
> 
> > >
> 
> > > wget to download pages or preferably most programming languages
> > > have their
> 
> > > own url download libraries e.g. Python has urllib2.
> 
> > > justext to remove boilerplate http://code.google.com/p/justext/
> 
> > > Onion for deduplication http://code.google.com/p/onion/
> 
> > >
> 
> > > Sketch Engine ( http://www.sketchengine.co.uk/ ) has built
> > > WebBootCat which
> 
> > > makes corpus collection easy for any language (and has good
> 
> > > filtering techniques for removing spam pages). WebBootCat allows
> > > you to
> 
> > > download domain specific corpus for any language, extract
> > > keywords
> > > from the
> 
> > > downloaded corpus, and repetitively collect more corpora from
> > > your
> > > new key
> 
> > > words. Or you could try BooTCaT http://bootcat.sslmit.unibo.it/
> 
> > >
> 
> > > For the kind of problems you face while building a corpus for a
> > > language,
> 
> > > please refer to A Corpus Factory for many languages.
> 
> > >
> 
> > > best regards,
> 
> > > Siva
> 
> > >
> 
> > > On Thu, Jun 21, 2012 at 2:55 PM, Imene Bensalem <
> > > bens.imene at gmail.com >
> 
> > > wrote:
> 
> > >>
> 
> > >> Dear all,
> 
> > >> I would build a corpus of Arabic text, and I would ask you about
> > >> tools you
> 
> > >> know to download text (or html pages) form the source websites.
> 
> > >> I tried to use WinHTTrak to download pages form Wikipedia but
> 
> > >> it always show me an error and did download anything.
> 
> > >> Thank you
> 
> > >> Best regards
> 
> > >>
> 
> > >> Imene Bensalem
> 
> > >> Mentouri University, Constantine , Algeria
> 
> > >>
> 
> > >> _______________________________________________
> 
> > >> UNSUBSCRIBE from this page:
> > >> http://mailman.uib.no/options/corpora
> 
> > >> Corpora mailing list
> 
> > >> Corpora at uib.no
> 
> > >> http://mailman.uib.no/listinfo/corpora
> 
> > >>
> 
> > >
> 
> > >
> 
> > >
> 
> > > _______________________________________________
> 
> > > UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> 
> > > Corpora mailing list
> 
> > > Corpora at uib.no
> 
> > > http://mailman.uib.no/listinfo/corpora
> 
> > >
> 

> > _______________________________________________
> 
> > UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> 
> > Corpora mailing list
> 
> > Corpora at uib.no
> 
> > http://mailman.uib.no/listinfo/corpora
> 

> --

> Open Source Solutions for Text Engineering

> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble

> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20120622/2698c449/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list