[Corpora-List] Getting articles from newspapers to compile a corpus

Daniel Stein danielstein81 at gmail.com
Fri Nov 30 07:42:13 UTC 2012


Dear Matías,

another tool you could use to scrape newspaper pages is scrapy which is
Python based (http://scrapy.org/).

With respect to oral corpora of latinamerican Spanish I can recommend you
the Hamburg Corpus of Argentinean Spanish (HaCASpa)
http://www.corpora.uni-hamburg.de/sfb538/en_h9_hacaspa.html

Kind regards
Daniel

2012/11/29 Matías Guzmán <mortem.dei at gmail.com>
>
> Thanks for all your answers :)
>
> I'm interested in Spanish. I already have a corpus of about 20 newspapers
from Spain, and now I would like to compile corpora for a couple of
countries in America. My project (it's for my MA thesis) is trying to
predict from the morphosyntactic and lexical features of sentences if the
sentence is a pro drop construction (so yes, I'll be using stats). I have
reason to believe that the rate of pro drop varies from country to country.
I already have oral corpora for Spain and Colombia, but finding for other
countries has proven really difficult. I thought that newspaper corpora
could be a nice way of getting documents for many different countries.
>
> I already tried wget, it seems to work quite well, but I wasn't able to
clean the html files it creates using BeautifulSoup for python. Maybe
somebody know of other software capable of doing this?
>
> Matías
-- 
*Daniel Stein*
Universität Hamburg
Hamburger Zentrum für Sprachkorpora <http://www.corpora.uni-hamburg.de/>
Max-Brauer-Allee 60
22765 Hamburg
Germany

Tel.: +49 (40) 42838-6425
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20121130/e432fad2/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list