[Corpora-List] Extracting only editorial content from a HTML page

Vlado Keselj vlado at cs.dal.ca
Wed Aug 10 13:20:27 UTC 2005


This is becoming a *really* long thread, but still I am tempted to add
my $.02.

I use a Perl script which grabs a web page, does some pre-processing,
reports new pieces using diff command, with some post-processing.
The algorithm is as follows:
1. get webpage (for this one can use wget, lynx, or some other way)
2. pre-processing (usually one wants to remove tags, but not necessarily; 
               e.g. lynx -dump, Tidy, or clean_html)
3. if there is previous page version then
4.   | diff this with old capturing new stuff
5. save this page to old
6. if there was a diff then webpage is only new stuff
7. post-processing

Step 2 may become very interesting.  Diff is very good, but still it 
depends on physical lines which are not always defined in an ideal way, so 
you may want to "reshape" them in step 2.

If a page dramatically changes, one gets a burst of noise, but the 
"extractor" self-stabilizes with no just wonderfully.  I use it as 
page-watch, run it as a cron-job, and mail any diffs.

If anybody is interested I can send/post my Perl script (after some 
clean-up).

--Vlado



More information about the Corpora mailing list