[Corpora-List] Extracting only editorial content from a HTML page

Wed Aug 10 22:11:32 UTC 2005

On Wed, 10 Aug 2005, Vlado Keselj wrote:

> 
> This is becoming a *really* long thread, but still I am tempted to add
> my $.02.
> 
> I use a Perl script which grabs a web page, does some pre-processing,
> reports new pieces using diff command, with some post-processing.
> The algorithm is as follows:
> 1. get webpage (for this one can use wget, lynx, or some other way)
> 2. pre-processing (usually one wants to remove tags, but not necessarily; 
>                e.g. lynx -dump, Tidy, or clean_html)
> 3. if there is previous page version then
> 4.   | diff this with old capturing new stuff
> 5. save this page to old
> 6. if there was a diff then webpage is only new stuff
> 7. post-processing
> 
> Step 2 may become very interesting.  Diff is very good, but still it 
> depends on physical lines which are not always defined in an ideal way, so 
> you may want to "reshape" them in step 2.
> 
> If a page dramatically changes, one gets a burst of noise, but the 
> "extractor" self-stabilizes with no just wonderfully.  I use it as 
> page-watch, run it as a cron-job, and mail any diffs.
> 
> If anybody is interested I can send/post my Perl script (after some 
> clean-up).
> 
> --Vlado

Appended. --Vlado

#!/usr/bin/perl
# www.cs.dal.ca/~vlado/srcperl/report-new.pl
# (c) 2000-2005 Vlado Keselj

sub help { print <<"#EOT" }
# Report new material on a web page, version $VERSION
#
# Uses diff, lynx, sendmail (if option -e is used)
#
# Usage: report-new.pl [switches] URL
#  -h    Print help and exit.
#  -v    Print version of the program and exit.
#  -e email Sends output, if not empty, to email.
#EOT

use strict;
use POSIX qw(strftime);

use vars qw( $VERSION );
$VERSION = sprintf "%d.%d", q$Revision: 1.1 $ =~ /(\d+)/g;

use Getopt::Std;
use vars qw($opt_v $opt_h $opt_e);
getopts("hve:");

if ($opt_v) { print "$VERSION\n"; exit; }
elsif ($opt_h || !@ARGV) { &help(); exit; }

($#ARGV==0 && $ARGV[0]=~/^http:\/\//) ||
    die "Format: report-new.pl http://...\n";

my ($urlbase, $url);
$urlbase = $url = shift;   # E.g.: http://www.cs.dal.ca/~vlado/srcperl
if ( $url =~ m.//[^/]*/. )
{ $urlbase = $`.$& }	   # E.g.: http://www.cs.dal.ca/

my $urlId = &encode_w1($url);
my $timestamp = strftime("%Y-%m-%d-%T", localtime(time));

if (! -d 'tmp')
{ mkdir 'tmp', 0700 or die "can't mkdir tmp: $!" }
if (! -d 'report-new.pl.d')
{ mkdir 'report-new.pl.d', 0700 or die "can't mkdir report-new.pl.d: $!" }

my $TmpBase  = "tmp/$urlId-$timestamp";
my $TmpFile1 = "$TmpBase-1";
my $lastFile = "report-new.pl.d/$urlId.last";
-e $lastFile or putfile($lastFile,'');

# First step: grab the page
$url =~ s/'/'"'"'/g;
my $material = `lynx -dump -nolist '$url'`;
putfile($TmpFile1, $material);

$material = `diff $TmpFile1 $lastFile 2>&1`;
$material =~ s/^[^<].*\n//mg;
$material =~ s/^< //mg;

if ($material) {
    if ($opt_e) {
        my $out;
        open($out, "|sendmail -t") or die;
	print $out "To: $opt_e\n".
	     "Subject: [report-new.pl] $url\n\n$material";
	close($out);
    }
    else { print $material }
}
unlink($lastFile);
rename($TmpFile1, $lastFile);

sub putfile($@) {
    my $f = shift;
    local *F;
    open(F, ">$f") or die "putfile:cannot open $f:$!";
    print F '' unless @_;
    while (@_) { print F shift(@_) }
    close(F)
    }

sub encode_w1 {
    local $_ = shift;
    s/[\W_]/'_'.uc unpack("H2",$&)/ge;
    return $_;
}

## END