[Corpora-List] RE tool for extracting text from web forum and websites

Paul Johnston paul.johnston at manchester.ac.uk
Wed Oct 14 17:11:58 UTC 2009


wget is the most basic tool I know of but when you say a Web Forum do 
you have one in mind?

GNU Wget                                                  WGET(1)

NAME
    Wget - The non-interactive network downloader.

SYNOPSIS
    wget [option]... [URL]...

DESCRIPTION
    GNU Wget is a free utility for non-interactive download of
    files from the Web.  It supports HTTP, HTTPS, and FTP
    protocols, as well as retrieval through HTTP proxies.

    Wget is non-interactive, meaning that it can work in the
    background, while the user is not logged on.  This allows
    you to start a retrieval and disconnect from the system,
    letting Wget finish the work.  By contrast, most of the Web
    browsers require constant user's presence, which can be a
    great hindrance when transferring a lot of data.

    Wget can follow links in HTML and XHTML pages and create
    local versions of remote web sites, fully recreating the
    directory structure of the original site.  This is sometimes
    referred to as "recursive downloading."  While doing that,
    Wget respects the Robot Exclusion Standard (/robots.txt).
    Wget can be instructed to convert the links in downloaded
    HTML files to the local files for offline viewing.

    Wget has been designed for robustness over slow or unstable
    network connections; if a download fails due to a network
    problem, it will keep retrying until the whole file has been
    retrieved.  If the server supports regetting, it will
    instruct the server to continue the download from where it
    left off.

The big ones (facebook, twitter etc) have published API's which allow 
you to get stuff off them but this requires a bit of programming skill, 
normally in Python these days it seems.

Cheers Paul

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list