[Corpora-List] Extracting only editorial content from a HTML page
Tom Emerson
tree at basistech.com
Tue Aug 9 20:23:57 UTC 2005
Lou Burnard writes:
> The other tool for this purpose which no-one has (so far) mentioned is
> tidy -- http://tidy.,sourceforge.net
>
> It will take almost any html and turn it into something usable very
> fast; it's also very robust and there is a choice of APIs for
> integrating it into your own production system
Just a warning to folks: while Tidy is good, it can get very confused
on bogus HTML, and will crash horribly in ways that are non-trivial to
debug. I've found that pages which have bogus JavaScript embedded can
cause lots of problems, as well as pages in stranger character
encodings.
-tree
--
Tom Emerson Basis Technology Corp.
Software Architect http://www.basistech.com
"You can't fake quality any more than you can fake a good meal." (W.S.B.)
More information about the Corpora
mailing list