[Corpora-List] Web Content Extractor / Screen Scraper

Mon Jun 18 21:54:30 UTC 2007

Resty,

take a look at the CORPORA archive for web-as-corpus tools:

http://listserv.linguistlist.org/cgi-bin/wa?A2=ind0705&L=CORPORA&P=R1226&I=-3

"... You can use a web-as-corpus collection tool such as WWW-Bootcat,
a web-interface to Baroni's perl BootCat:
http://corpora.fi.muni.cz/bootcat/

or WeBoCa, a Java alternative by Leeds student Michael Drayson, an
extension of Andy Roberts' JBootCat: http://code.google.com/p/weboca/
..."

Eric Atwell, Leeds University

On Tue, 19 Jun 2007, Resty Cena wrote:

> Hello,
> I am looking for a free or open-source Windows utility/application that
> extract text-only rendered (not raw) contents of web pages, such as one
> would use for automatically scraping news feeds. Does anyone use such an
> application?
>
> Basically the application will be used to harvest texts on the internet to
> build a corpus.
>
> All the best,
> Resty
>