[Corpora-List] Getting articles from newspapers to compile a corpus

Thu Nov 29 21:28:40 UTC 2012

Dear Matias,

I'm afraid I can't help concerning your question, but I would like to comment 
that Mike Maxwell has made a very good point regarding cleaning up the 
articles.  I had a very small corpus for my doctorate of just 73 articles about 
the same topic taken only from two days of various newspapers.  Because so many 
newspapers get their information from the same news services, I found a few 
articles that I had to disgard because of an over 80%  similarity ratio and of 
course that skews statistics. For such a small corpus, it was very easy to find 
the similarities using a plagiarism tool 
http://plagiarism.bloomfieldmedia.com/z-wordpress/software/wcopyfind/  (if 
anyone is interested) -but perhaps statistics don't enter into your project.

Kindest regards,

Linda Bawcom
Houston Community College-Central

________________________________
From: Matías Guzmán <mortem.dei at gmail.com>
To: "corpora at uib.no" <corpora at uib.no>
Sent: Thu, November 29, 2012 12:29:16 PM
Subject: [Corpora-List] Getting articles from newspapers to compile a corpus

Hi all,

I was wondering if anyone knows how to get every possible article from online 
newspapers and magazines. I was thinking something like giving a program the URL 
of the newspaper (e.g. www.eltiempo.com) and getting the text from all pages 
therein. Is that possible?

Thanks a lot,

Matías
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20121129/ab263429/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora