<html><head><style type='text/css'>p { margin: 0; }</style></head><body><div style='font-family: times new roman,new york,times,serif; font-size: 12pt; color: #000000'>Hello,<br><br>You may use wikipedia dumps and extract needed information from it. <br>You may also interested in OSAC Arabic corpora http://sites.google.com/site/motazsite/Home/osac<br><br>good luck,<br>Motaz <br><br><hr id="zwchr"><blockquote style="border-left:2px solid rgb(16, 16, 255);margin-left:5px;padding-left:5px;color:#000;font-weight:normal;font-style:normal;text-decoration:none;font-family:Helvetica,Arial,sans-serif;font-size:12pt;"><b>From: </b>"Julien Nioche" <lists.digitalpebble@gmail.com><br><b>To: </b>"corpora" <CORPORA@uib.no><br><b>Sent: </b>Thursday, June 21, 2012 5:27:01 PM<br><b>Subject: </b>Re: [Corpora-List] How to download text from the web to build a corpus ?<br><br>Or <a href="http://nutch.apache.org" target="_blank">Apache Nutch </a>for crawling then <a href="https://github.com/DigitalPebble/behemoth" target="_blank">Behemoth </a>for text processing (Tika, GATE, UIMA, Language Id, etc...). <br><a href="http://commoncrawl.org/" target="_blank">http://commoncrawl.org/</a> is indeed an excellent resource and there is a module for ingesting the WARC files in Behemoth<br>
<br><div class="gmail_quote">On 21 June 2012 15:07, Craig Pfeifer <span dir="ltr"><<a href="mailto:craig.pfeifer@gmail.com" target="_blank">craig.pfeifer@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
You might also consider commoncrawl -> <a href="http://commoncrawl.org/" target="_blank">http://commoncrawl.org/</a><br>
<br>
Craig<br>
______________<br>
<a href="mailto:craig.pfeifer@gmail.com" target="_blank">craig.pfeifer@gmail.com</a><br>
<div class="HOEnZb"><div class="h5"><br>
<br>
On Thu, Jun 21, 2012 at 9:15 AM, Siva Reddy <<a href="mailto:siva@sivareddy.in" target="_blank">siva@sivareddy.in</a>> wrote:<br>
> Some tools which may help you:<br>
><br>
> wget to download pages or preferably most programming languages have their<br>
> own url download libraries e.g. Python has urllib2.<br>
> justext to remove boilerplate <a href="http://code.google.com/p/justext/" target="_blank">http://code.google.com/p/justext/</a><br>
> Onion for deduplication <a href="http://code.google.com/p/onion/" target="_blank">http://code.google.com/p/onion/</a><br>
><br>
> Sketch Engine (<a href="http://www.sketchengine.co.uk/" target="_blank">http://www.sketchengine.co.uk/</a>) has built WebBootCat which<br>
> makes corpus collection easy for any language (and has good<br>
> filtering techniques for removing spam pages). WebBootCat allows you to<br>
> download domain specific corpus for any language, extract keywords from the<br>
> downloaded corpus, and repetitively collect more corpora from your new key<br>
> words. Or you could try BooTCaT <a href="http://bootcat.sslmit.unibo.it/" target="_blank">http://bootcat.sslmit.unibo.it/</a><br>
><br>
> For the kind of problems you face while building a corpus for a language,<br>
> please refer to A Corpus Factory for many languages.<br>
><br>
> best regards,<br>
> Siva<br>
><br>
> On Thu, Jun 21, 2012 at 2:55 PM, Imene Bensalem <<a href="mailto:bens.imene@gmail.com" target="_blank">bens.imene@gmail.com</a>><br>
> wrote:<br>
>><br>
>> Dear all,<br>
>> I would build a corpus of Arabic text, and I would ask you about tools you<br>
>> know to download text (or html pages) form the source websites.<br>
>> I tried to use WinHTTrak to download pages form Wikipedia but<br>
>> it always show me an error and did download anything.<br>
>> Thank you<br>
>> Best regards<br>
>><br>
>> Imene Bensalem<br>
>> Mentouri University, Constantine , Algeria<br>
>><br>
>> _______________________________________________<br>
>> UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/corpora</a><br>
>> Corpora mailing list<br>
>> <a href="mailto:Corpora@uib.no" target="_blank">Corpora@uib.no</a><br>
>> <a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>
>><br>
><br>
><br>
><br>
> _______________________________________________<br>
> UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/corpora</a><br>
> Corpora mailing list<br>
> <a href="mailto:Corpora@uib.no" target="_blank">Corpora@uib.no</a><br>
> <a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>
><br>
<br>
_______________________________________________<br>
UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/corpora</a><br>
Corpora mailing list<br>
<a href="mailto:Corpora@uib.no" target="_blank">Corpora@uib.no</a><br>
<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>
</div></div></blockquote></div><br><br clear="all"><br>-- <br><span style="border-collapse:;font-family:'Times New Roman';font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;font-size:medium"><span style="font-family:arial;font-size:small"><b style="color:;font-family:arial,helvetica,sans-serif"><img src="http://digitalpebble.com/img/logo.gif" height="38" width="200"><br style="color:;font-family:arial,helvetica,sans-serif">
</b><span style="color:;font-family:arial,helvetica,sans-serif"><span style="color:">Open Source Solutions for Text Engineering</span><br><br></span></span></span><span style="color:"><a href="http://digitalpebble.blogspot.com/" target="_blank">http://digitalpebble.blogspot.com/</a></span><br style="color:">
<span style="color:"><a href="http://www.digitalpebble.com" target="_blank">http://www.digitalpebble.com</a><br><a href="http://twitter.com/digitalpebble" target="_blank">http://twitter.com/digitalpebble</a></span><br>
<br>
<br>_______________________________________________<br>UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora<br>Corpora mailing list<br>Corpora@uib.no<br>http://mailman.uib.no/listinfo/corpora<br></blockquote><br></div></body></html>