[Corpora-List] Web Content Extractor / Screen Scraper
Eric Atwell
eric at comp.leeds.ac.uk
Mon Jun 18 21:54:30 UTC 2007
Resty,
take a look at the CORPORA archive for web-as-corpus tools:
http://listserv.linguistlist.org/cgi-bin/wa?A2=ind0705&L=CORPORA&P=R1226&I=-3
"... You can use a web-as-corpus collection tool such as WWW-Bootcat,
a web-interface to Baroni's perl BootCat:
http://corpora.fi.muni.cz/bootcat/
or WeBoCa, a Java alternative by Leeds student Michael Drayson, an
extension of Andy Roberts' JBootCat: http://code.google.com/p/weboca/
..."
Eric Atwell, Leeds University
On Tue, 19 Jun 2007, Resty Cena wrote:
> Hello,
> I am looking for a free or open-source Windows utility/application that
> extract text-only rendered (not raw) contents of web pages, such as one
> would use for automatically scraping news feeds. Does anyone use such an
> application?
>
> Basically the application will be used to harvest texts on the internet to
> build a corpus.
>
> All the best,
> Resty
>
More information about the Corpora
mailing list