[Corpora-List] How to download text from the web to build a corpus ?

Julien Nioche lists.digitalpebble at gmail.com
Thu Jun 21 15:27:01 UTC 2012


Or Apache Nutch <http://nutch.apache.org>for crawling then Behemoth
<https://github.com/DigitalPebble/behemoth>for text processing (Tika, GATE,
UIMA, Language Id, etc...).
http://commoncrawl.org/ is indeed an excellent resource and there is a
module for ingesting the WARC files in Behemoth

On 21 June 2012 15:07, Craig Pfeifer <craig.pfeifer at gmail.com> wrote:

> You might also consider commoncrawl -> http://commoncrawl.org/
>
> Craig
> ______________
> craig.pfeifer at gmail.com
>
>
> On Thu, Jun 21, 2012 at 9:15 AM, Siva Reddy <siva at sivareddy.in> wrote:
> > Some tools which may help you:
> >
> > wget to download pages or preferably most programming languages have
> their
> > own url download libraries e.g. Python has urllib2.
> > justext to remove boilerplate http://code.google.com/p/justext/
> > Onion for deduplication http://code.google.com/p/onion/
> >
> > Sketch Engine (http://www.sketchengine.co.uk/) has built WebBootCat
> which
> > makes corpus collection easy for any language (and has good
> > filtering techniques for removing spam pages). WebBootCat allows you to
> > download domain specific corpus for any language, extract keywords from
> the
> > downloaded corpus, and repetitively collect more corpora from your new
> key
> > words. Or you could try BooTCaT http://bootcat.sslmit.unibo.it/
> >
> > For the kind of problems you face while building a corpus for a language,
> > please refer to A Corpus Factory for many languages.
> >
> > best regards,
> > Siva
> >
> > On Thu, Jun 21, 2012 at 2:55 PM, Imene Bensalem <bens.imene at gmail.com>
> > wrote:
> >>
> >> Dear all,
> >> I would build a corpus of Arabic text, and I would ask you about tools
> you
> >> know to  download text (or html pages) form the source websites.
> >> I tried to use WinHTTrak to download pages form Wikipedia but
> >> it always show me an error and did download anything.
> >> Thank you
> >> Best regards
> >>
> >> Imene Bensalem
> >> Mentouri University, Constantine , Algeria
> >>
> >> _______________________________________________
> >> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> >> Corpora mailing list
> >> Corpora at uib.no
> >> http://mailman.uib.no/listinfo/corpora
> >>
> >
> >
> >
> > _______________________________________________
> > UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> > Corpora mailing list
> > Corpora at uib.no
> > http://mailman.uib.no/listinfo/corpora
> >
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20120621/d17a439f/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list