[Corpora-List] RE tool for extracting text from web forum and websites
Paul Johnston
paul.johnston at manchester.ac.uk
Wed Oct 14 17:11:58 UTC 2009
wget is the most basic tool I know of but when you say a Web Forum do
you have one in mind?
GNU Wget WGET(1)
NAME
Wget - The non-interactive network downloader.
SYNOPSIS
wget [option]... [URL]...
DESCRIPTION
GNU Wget is a free utility for non-interactive download of
files from the Web. It supports HTTP, HTTPS, and FTP
protocols, as well as retrieval through HTTP proxies.
Wget is non-interactive, meaning that it can work in the
background, while the user is not logged on. This allows
you to start a retrieval and disconnect from the system,
letting Wget finish the work. By contrast, most of the Web
browsers require constant user's presence, which can be a
great hindrance when transferring a lot of data.
Wget can follow links in HTML and XHTML pages and create
local versions of remote web sites, fully recreating the
directory structure of the original site. This is sometimes
referred to as "recursive downloading." While doing that,
Wget respects the Robot Exclusion Standard (/robots.txt).
Wget can be instructed to convert the links in downloaded
HTML files to the local files for offline viewing.
Wget has been designed for robustness over slow or unstable
network connections; if a download fails due to a network
problem, it will keep retrying until the whole file has been
retrieved. If the server supports regetting, it will
instruct the server to continue the download from where it
left off.
The big ones (facebook, twitter etc) have published API's which allow
you to get stuff off them but this requires a bit of programming skill,
normally in Python these days it seems.
Cheers Paul
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list