[Corpora-List] Getting articles from newspapers to compile a corpus
Linda Bawcom
linda.bawcom at sbcglobal.net
Thu Nov 29 21:28:40 UTC 2012
Dear Matias,
I'm afraid I can't help concerning your question, but I would like to comment
that Mike Maxwell has made a very good point regarding cleaning up the
articles. I had a very small corpus for my doctorate of just 73 articles about
the same topic taken only from two days of various newspapers. Because so many
newspapers get their information from the same news services, I found a few
articles that I had to disgard because of an over 80% similarity ratio and of
course that skews statistics. For such a small corpus, it was very easy to find
the similarities using a plagiarism tool
http://plagiarism.bloomfieldmedia.com/z-wordpress/software/wcopyfind/ (if
anyone is interested) -but perhaps statistics don't enter into your project.
Kindest regards,
Linda Bawcom
Houston Community College-Central
________________________________
From: Matías Guzmán <mortem.dei at gmail.com>
To: "corpora at uib.no" <corpora at uib.no>
Sent: Thu, November 29, 2012 12:29:16 PM
Subject: [Corpora-List] Getting articles from newspapers to compile a corpus
Hi all,
I was wondering if anyone knows how to get every possible article from online
newspapers and magazines. I was thinking something like giving a program the URL
of the newspaper (e.g. www.eltiempo.com) and getting the text from all pages
therein. Is that possible?
Thanks a lot,
Matías
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20121129/ab263429/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list