Thanks for all your answers :)<div><br></div><div>I'm interested in Spanish. I already have a corpus of about 20 newspapers from Spain, and now I would like to compile corpora for a couple of countries in America. My project (it's for my MA thesis) is trying to predict from the morphosyntactic and lexical features of sentences if the sentence is a pro drop construction (so yes, I'll be using stats). I have reason to believe that the rate of pro drop varies from country to country. I already have oral corpora for Spain and Colombia, but finding for other countries has proven really difficult. I thought that newspaper corpora could be a nice way of getting documents for many different countries.<div>
<br><div>I already tried wget, it seems to work quite well, but I wasn't able to clean the html files it creates using BeautifulSoup for python. Maybe somebody know of other software capable of doing this?</div></div>
<div><br></div><div>Matías</div></div><div class="gmail_extra"><br><br><div class="gmail_quote">2012/11/29 Linda Bawcom <span dir="ltr"><<a href="mailto:linda.bawcom@sbcglobal.net" target="_blank">linda.bawcom@sbcglobal.net</a>></span><br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div style="font-family:tahoma,new york,times,serif;font-size:12pt"><div>Dear Matias,</div>
<div> </div>
<div>I'm afraid I can't help concerning your question, but I would like to comment that Mike Maxwell has made a very good point regarding cleaning up the articles. I had a very small corpus for my doctorate of just 73 articles about the same topic taken only from two days of various newspapers. Because so many newspapers get their information from the same news services, I found a few articles that I had to disgard because of an over 80% similarity ratio and of course that skews statistics. For such a small corpus, it was very easy to find the similarities using a plagiarism tool <a href="http://plagiarism.bloomfieldmedia.com/z-wordpress/software/wcopyfind/" target="_blank">http://plagiarism.bloomfieldmedia.com/z-wordpress/software/wcopyfind/</a> (if anyone is interested) -but perhaps statistics don't enter into your project.</div>
<div> </div>
<div>Kindest regards,</div>
<div> </div>
<div>Linda Bawcom</div>
<div>Houston Community College-Central<br></div>
<div style="FONT-FAMILY:tahoma,new york,times,serif;FONT-SIZE:12pt"><br>
<div style="FONT-FAMILY:times new roman,new york,times,serif;FONT-SIZE:12pt"><font face="Tahoma">
<hr size="1">
<b><span style="FONT-WEIGHT:bold">From:</span></b> Matías Guzmán <<a href="mailto:mortem.dei@gmail.com" target="_blank">mortem.dei@gmail.com</a>><br><b><span style="FONT-WEIGHT:bold">To:</span></b> "<a href="mailto:corpora@uib.no" target="_blank">corpora@uib.no</a>" <<a href="mailto:corpora@uib.no" target="_blank">corpora@uib.no</a>><br>
<b><span style="FONT-WEIGHT:bold">Sent:</span></b> Thu, November 29, 2012 12:29:16 PM<br><b><span style="FONT-WEIGHT:bold">Subject:</span></b> [Corpora-List] Getting articles from newspapers to compile a corpus<br></font><div>
<div class="h5"><br>Hi all,<br><br>I was wondering if anyone knows how to get every possible article from online newspapers and magazines. I was thinking something like giving a program the URL of the newspaper (e.g. <a href="http://www.eltiempo.com/" rel="nofollow" target="_blank">www.eltiempo.com</a>) and getting the text from all pages therein. Is that possible?<br>
<br>Thanks a lot,<br><br>Matías<br></div></div></div></div></div></div></blockquote></div><br></div>