[Corpora-List] How to download text from the web to build a corpus ?

Thu Jul 12 09:38:29 UTC 2012

Dear Imene,

A good program for this task is Justext: http://code.google.com/p/justext/
I haven't tried it with arabic but it is among the languages to be chosen in
the live demo: http://nlp.fi.muni.cz/projects/justext/ so maybe it could be
suitable for your purposes

Regards,

Marco Brunello
Centre for Translation Studies
University of Leeds

	--------- Original Message --------
	Da: corpora at uib.no
	To: corpora at uib.no <corpora at uib.no>
	Oggetto: Corpora Digest, Vol 61, Issue 9
	Data: 10/07/12 12:02

> Date: Mon, 9 Jul 2012 12:22:37 +0200
> From: Renaud Richardet <renaud.richardet at epfl.ch>
> Subject: Re: [Corpora-List] How to download text from the web to build
> 	a corpus ?
> To: Richard Littauer <richard.littauer at gmail.com>
> Cc: Eleftherios Avramidis <eleftherios.avramidis at dfki.de>,
> 	corpora at uib.no
> 
> If you want to download Wikipedia content only, you can get their
> database dumps (http://dumps.wikimedia.org/) directly.
> 
> -- Renaud
> 
> On Mon, Jul 9, 2012 at 11:16 AM, Richard Littauer
> <richard.littauer at gmail.com> wrote:
> > Beyond scrapy, if you need better HTML parsing in Python, I would
suggest
> > Beautiful Soup. It's what I've used on several projects, and it's never
let
> > me down yet.
> >
> > R
> >
> > --
> > Richard Littauer
> > Erasmus Mundus MSc in Computational Linguistics
> > Saarland University
> > http://www.rlittauer.com | @richlitt
> >
> >
> >
> > On Thu, Jun 21, 2012 at 3:33 PM, Eleftherios Avramidis
> > <eleftherios.avramidis at dfki.de> wrote:
> >>
> >> Hi Imene,
> >>
> >> if you are familiar with Python, I would suggest the scrapy project, as
> >> you can easily isolate parts of the page that you are interested in.
> >>
> >> Btw, Wikipedia I think offers the possibility to download the content
in a
> >> compressed archive. This way you avoid stressing their server.
> >>
> >> best
> >> Lefteris
> >>
> >>
> >> On 21/06/12 11:25, Imene Bensalem wrote:
> >>
> >> Dear all,
> >> I would build a corpus of Arabic text, and I would ask you about tools
you
> >> know to  download text (or html pages) form the source websites.
> >> I tried to use WinHTTrak to download pages form Wikipedia but it always
> >> show me an error and did download anything.
> >> Thank you
> >> Best regards
> >>
> >> Imene Bensalem
> >> Mentouri University, Constantine , Algeria
> >>
> >> 
 --
 Caselle da 1GB, trasmetti allegati fino a 3GB e in piu' IMAP, POP3 e SMTP
autenticato? GRATIS solo con Email.it: http://www.email.it/f

 Sponsor:
 Cesenatico Hotel Massimo Resort fino al 03/08 - Euro 999 a camera fino a 4
persone per una settimana in formula All Inclusive 

 Clicca qui: http://adv.email.it/cgi-bin/foclick.cgi?mid=12470&d=20120712

 --
 Caselle da 1GB, trasmetti allegati fino a 3GB e in piu' IMAP, POP3 e SMTP autenticato? GRATIS solo con Email.it http://www.email.it/f

 Sponsor:
 Cesenatico Hotel Massimo Resort fino al 03/08 - Euro 999 a camera fino a 4 persone per una settimana in formula All Inclusive 
 Clicca qui: http://adv.email.it/cgi-bin/foclick.cgi?mid=12467&d=12-7

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora