[Corpora-List] Getting articles from newspapers to compile a corpus

Matías Guzmán mortem.dei at gmail.com
Thu Nov 29 21:54:46 UTC 2012


Thanks for all your answers :)

I'm interested in Spanish. I already have a corpus of about 20 newspapers
from Spain, and now I would like to compile corpora for a couple of
countries in America. My project (it's for my MA thesis) is trying to
predict from the morphosyntactic and lexical features of sentences if the
sentence is a pro drop construction (so yes, I'll be using stats). I have
reason to believe that the rate of pro drop varies from country to country.
I already have oral corpora for Spain and Colombia, but finding for other
countries has proven really difficult. I thought that newspaper corpora
could be a nice way of getting documents for many different countries.

I already tried wget, it seems to work quite well, but I wasn't able to
clean the html files it creates using BeautifulSoup for python. Maybe
somebody know of other software capable of doing this?

Matías


2012/11/29 Linda Bawcom <linda.bawcom at sbcglobal.net>

> Dear Matias,
>
> I'm afraid I can't help concerning your question, but I would like to
> comment that Mike Maxwell has made a very good point regarding cleaning up
> the articles.  I had a very small corpus for my doctorate of just 73
> articles about the same topic taken only from two days of various
> newspapers.  Because so many newspapers get their information from the same
> news services, I found a few articles that I had to disgard because of an
> over 80%  similarity ratio and of course that skews statistics. For such a
> small corpus, it was very easy to find the similarities using a plagiarism
> tool http://plagiarism.bloomfieldmedia.com/z-wordpress/software/wcopyfind/  (if
> anyone is interested) -but perhaps statistics don't enter into your project.
>
> Kindest regards,
>
> Linda Bawcom
> Houston Community College-Central
>
>  ------------------------------
> *From:* Matías Guzmán <mortem.dei at gmail.com>
> *To:* "corpora at uib.no" <corpora at uib.no>
> *Sent:* Thu, November 29, 2012 12:29:16 PM
> *Subject:* [Corpora-List] Getting articles from newspapers to compile a
> corpus
>
> Hi all,
>
> I was wondering if anyone knows how to get every possible article from
> online newspapers and magazines. I was thinking something like giving a
> program the URL of the newspaper (e.g. www.eltiempo.com) and getting the
> text from all pages therein. Is that possible?
>
> Thanks a lot,
>
> Matías
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20121129/9bd56150/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list