[Corpora-List] How to download text from the web to build a corpus ?
Motaz SAAD
motaz.saad at inria.fr
Fri Jun 22 09:05:37 UTC 2012
Hello,
You may use wikipedia dumps and extract needed information from it.
You may also interested in OSAC Arabic corpora http://sites.google.com/site/motazsite/Home/osac
good luck,
Motaz
----- Original Message -----
> From: "Julien Nioche" <lists.digitalpebble at gmail.com>
> To: "corpora" <CORPORA at uib.no>
> Sent: Thursday, June 21, 2012 5:27:01 PM
> Subject: Re: [Corpora-List] How to download text from the web to
> build a corpus ?
> Or Apache Nutch for crawling then Behemoth for text processing (Tika,
> GATE, UIMA, Language Id, etc...).
> http://commoncrawl.org/ is indeed an excellent resource and there is
> a module for ingesting the WARC files in Behemoth
> On 21 June 2012 15:07, Craig Pfeifer < craig.pfeifer at gmail.com >
> wrote:
> > You might also consider commoncrawl -> http://commoncrawl.org/
>
> > Craig
>
> > ______________
>
> > craig.pfeifer at gmail.com
>
> > On Thu, Jun 21, 2012 at 9:15 AM, Siva Reddy < siva at sivareddy.in >
> > wrote:
>
> > > Some tools which may help you:
>
> > >
>
> > > wget to download pages or preferably most programming languages
> > > have their
>
> > > own url download libraries e.g. Python has urllib2.
>
> > > justext to remove boilerplate http://code.google.com/p/justext/
>
> > > Onion for deduplication http://code.google.com/p/onion/
>
> > >
>
> > > Sketch Engine ( http://www.sketchengine.co.uk/ ) has built
> > > WebBootCat which
>
> > > makes corpus collection easy for any language (and has good
>
> > > filtering techniques for removing spam pages). WebBootCat allows
> > > you to
>
> > > download domain specific corpus for any language, extract
> > > keywords
> > > from the
>
> > > downloaded corpus, and repetitively collect more corpora from
> > > your
> > > new key
>
> > > words. Or you could try BooTCaT http://bootcat.sslmit.unibo.it/
>
> > >
>
> > > For the kind of problems you face while building a corpus for a
> > > language,
>
> > > please refer to A Corpus Factory for many languages.
>
> > >
>
> > > best regards,
>
> > > Siva
>
> > >
>
> > > On Thu, Jun 21, 2012 at 2:55 PM, Imene Bensalem <
> > > bens.imene at gmail.com >
>
> > > wrote:
>
> > >>
>
> > >> Dear all,
>
> > >> I would build a corpus of Arabic text, and I would ask you about
> > >> tools you
>
> > >> know to download text (or html pages) form the source websites.
>
> > >> I tried to use WinHTTrak to download pages form Wikipedia but
>
> > >> it always show me an error and did download anything.
>
> > >> Thank you
>
> > >> Best regards
>
> > >>
>
> > >> Imene Bensalem
>
> > >> Mentouri University, Constantine , Algeria
>
> > >>
>
> > >> _______________________________________________
>
> > >> UNSUBSCRIBE from this page:
> > >> http://mailman.uib.no/options/corpora
>
> > >> Corpora mailing list
>
> > >> Corpora at uib.no
>
> > >> http://mailman.uib.no/listinfo/corpora
>
> > >>
>
> > >
>
> > >
>
> > >
>
> > > _______________________________________________
>
> > > UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>
> > > Corpora mailing list
>
> > > Corpora at uib.no
>
> > > http://mailman.uib.no/listinfo/corpora
>
> > >
>
> > _______________________________________________
>
> > UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>
> > Corpora mailing list
>
> > Corpora at uib.no
>
> > http://mailman.uib.no/listinfo/corpora
>
> --
> Open Source Solutions for Text Engineering
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20120622/2698c449/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list