[Corpora-List] Getting articles from newspapers to compile a corpus

Mark Davies Mark_Davies at byu.edu
Thu Nov 29 23:14:24 UTC 2012


>>  I already tried wget, it seems to work quite well, but I wasn't able to clean the html files it creates using BeautifulSoup for python. 
>> Maybe somebody know of other software capable of doing this?

I like JusText: http://code.google.com/p/justext/ (online demo: http://nlp.fi.muni.cz/projects/justext/)

I recently used it to clean about 2 billion words of web pages -- worked great.

BTW, if you're only downloading articles from 3-4 newspapers, you can usually figure out what HTML code for a particular newspaper is used to indicate the beginning and end of the "text". But for a more heterogeneous collection of texts, something like JusText is better.

MD

============================================
Mark Davies
Professor of Linguistics / Brigham Young University
http://davies-linguistics.byu.edu/
** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
============================================




From: corpora-bounces at uib.no [corpora-bounces at uib.no] on behalf of Matías Guzmán [mortem.dei at gmail.com]
Sent: Thursday, November 29, 2012 2:54 PM
To: Linda Bawcom
Cc: corpora at uib.no
Subject: Re: [Corpora-List] Getting articles from newspapers to compile a corpus


Thanks for all your answers :) 


I'm interested in Spanish. I already have a corpus of about 20 newspapers from Spain, and now I would like to compile corpora for a couple of countries in America. My project (it's for my MA thesis) is trying to predict from the morphosyntactic and lexical features of sentences if the sentence is a pro drop construction (so yes, I'll be using stats). I have reason to believe that the rate of pro drop varies from country to country. I already have oral corpora for Spain and Colombia, but finding for other countries has proven really difficult. I thought that newspaper corpora could be a nice way of getting documents for many different countries. 


I already tried wget, it seems to work quite well, but I wasn't able to clean the html files it creates using BeautifulSoup for python. Maybe somebody know of other software capable of doing this?


Matías



2012/11/29 Linda Bawcom <linda.bawcom at sbcglobal.net>

Dear Matias,

I'm afraid I can't help concerning your question, but I would like to comment that Mike Maxwell has made a very good point regarding cleaning up the articles.  I had a very small corpus for my doctorate of just 73 articles about the same topic taken only from two days of various newspapers.  Because so many newspapers get their information from the same news services, I found a few articles that I had to disgard because of an over 80%  similarity ratio and of course that skews statistics. For such a small corpus, it was very easy to find the similarities using a plagiarism tool http://plagiarism.bloomfieldmedia.com/z-wordpress/software/wcopyfind/  (if anyone is interested) -but perhaps statistics don't enter into your project.

Kindest regards,

Linda Bawcom
Houston Community College-Central





From: Matías Guzmán <mortem.dei at gmail.com>
To: "corpora at uib.no" <corpora at uib.no>
Sent: Thu, November 29, 2012 12:29:16 PM
Subject: [Corpora-List] Getting articles from newspapers to compile a corpus


Hi all,

I was wondering if anyone knows how to get every possible article from online newspapers and magazines. I was thinking something like giving a program the URL of the newspaper (e.g. www.eltiempo.com) and getting the text from all pages therein. Is that possible?

Thanks a lot,

Matías
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list