[Corpora-List] Extracting only editorial content from a HTML page

Tom Emerson tree at basistech.com
Tue Aug 9 20:23:57 UTC 2005

Lou Burnard writes:
> The other tool for this purpose which no-one has (so far) mentioned is 
> tidy -- http://tidy.,sourceforge.net
> It will take almost any html and turn it into something usable very 
> fast; it's also very robust and there is a choice of APIs for 
> integrating it into your own production system

Just a warning to folks: while Tidy is good, it can get very confused
on bogus HTML, and will crash horribly in ways that are non-trivial to
debug. I've found that pages which have bogus JavaScript embedded can
cause lots of problems, as well as pages in stranger character


Tom Emerson                                          Basis Technology Corp.
Software Architect                                 http://www.basistech.com
 "You can't fake quality any more than you can fake a good meal." (W.S.B.)

More information about the Corpora mailing list