Dear Matías,<div><br><div>another tool you could use to scrape newspaper pages is scrapy which is Python based (<a href="http://scrapy.org/">http://scrapy.org/</a>).<br><br>With respect to oral corpora of latinamerican Spanish I can recommend you the Hamburg Corpus of Argentinean Spanish (HaCASpa) <a href="http://www.corpora.uni-hamburg.de/sfb538/en_h9_hacaspa.html">http://www.corpora.uni-hamburg.de/sfb538/en_h9_hacaspa.html</a><br>
<br>Kind regards<div>Daniel<br><br>2012/11/29 Matías Guzmán <<a href="mailto:mortem.dei@gmail.com">mortem.dei@gmail.com</a>><br>><br>> Thanks for all your answers :)<br>><br>> I'm interested in Spanish. I already have a corpus of about 20 newspapers from Spain, and now I would like to compile corpora for a couple of countries in America. My project (it's for my MA thesis) is trying to predict from the morphosyntactic and lexical features of sentences if the sentence is a pro drop construction (so yes, I'll be using stats). I have reason to believe that the rate of pro drop varies from country to country. I already have oral corpora for Spain and Colombia, but finding for other countries has proven really difficult. I thought that newspaper corpora could be a nice way of getting documents for many different countries.<br>
><br>> I already tried wget, it seems to work quite well, but I wasn't able to clean the html files it creates using BeautifulSoup for python. Maybe somebody know of other software capable of doing this?<br>><br>
> Matías<br>-- <br><div><b>Daniel Stein</b></div><div>Universität Hamburg<br><a href="http://www.corpora.uni-hamburg.de/" target="_blank">Hamburger Zentrum für Sprachkorpora</a><br>Max-Brauer-Allee 60<br>22765 Hamburg<br>
Germany<br><br>Tel.: +49 (40) 42838-6425</div></div></div></div><div><br></div>