[Corpora-List] Extracting only editorial content from a HTML page

Tom Emerson tree at basistech.com
Tue Aug 9 20:23:57 UTC 2005


Lou Burnard writes:
> The other tool for this purpose which no-one has (so far) mentioned is 
> tidy -- http://tidy.,sourceforge.net
> 
> It will take almost any html and turn it into something usable very 
> fast; it's also very robust and there is a choice of APIs for 
> integrating it into your own production system

Just a warning to folks: while Tidy is good, it can get very confused
on bogus HTML, and will crash horribly in ways that are non-trivial to
debug. I've found that pages which have bogus JavaScript embedded can
cause lots of problems, as well as pages in stranger character
encodings.

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Software Architect                                 http://www.basistech.com
 "You can't fake quality any more than you can fake a good meal." (W.S.B.)



More information about the Corpora mailing list