[Corpora-List] Extracting only editorial content from a HTML page
Paul Clough
p.d.clough at sheffield.ac.uk
Wed Aug 10 08:55:55 UTC 2005
Hi all,
Another useful reference is the VIPS work from microsoft:
http://research.microsoft.com/research/pubs/view.aspx?tr_id=690
They are segmentating pages based upon visual layout and seem to get good
results. In my own work, I used UNIX lynx with the -dump option which seemed to
work okay (quick and dirty though):
lynx -dump file.html > file.txt
Cheers,
Paul.
-------------------------------------------
Dr. Paul Clough
Dept. Information Studies
University of Sheffield
+44 (0)114 2222664
-------------------------------------------
More information about the Corpora
mailing list