[Corpora-List] Extracting only editorial content from a HTML page

Paul Clough p.d.clough at sheffield.ac.uk
Wed Aug 10 08:55:55 UTC 2005


Hi all,

Another useful reference is the VIPS work from microsoft:

http://research.microsoft.com/research/pubs/view.aspx?tr_id=690

They are segmentating pages based upon visual layout and seem to get good
results. In my own work, I used UNIX lynx with the -dump option which seemed to
work okay (quick and dirty though):

lynx -dump file.html > file.txt

Cheers,

Paul.


-------------------------------------------
Dr. Paul Clough     
Dept. Information Studies
University of Sheffield

+44 (0)114 2222664
-------------------------------------------



More information about the Corpora mailing list