[Corpora-List] Extracting only editorial content from a HTML page

Tue Aug 9 20:59:01 UTC 2005

Take this warning seriously.  Tidy is designed as an HTML checker, and
its ability to outputs content sans JSP and tags is a side benefit.
Because it is an HTML checker, it will refuse to output the content if
the HTML is bad (or if Tidy hasn't caught up to some new HTML idiom).

We used to use Tidy for this purpose and no longer do.

Max Copperman
Knova Software, Inc.

-----Original Message-----
From: owner-corpora at lists.uib.no [mailto:owner-corpora at lists.uib.no] On
Behalf Of Tom Emerson
Sent: Tuesday, August 09, 2005 1:24 PM
To: Lou Burnard
Cc: Rob Malouf; Helge Thomas Hellerud; corpora at uib.no
Subject: Re: [Corpora-List] Extracting only editorial content from a
HTML page

Lou Burnard writes:
> The other tool for this purpose which no-one has (so far) mentioned is

> tidy -- http://tidy.,sourceforge.net
> 
> It will take almost any html and turn it into something usable very 
> fast; it's also very robust and there is a choice of APIs for 
> integrating it into your own production system

Just a warning to folks: while Tidy is good, it can get very confused
on bogus HTML, and will crash horribly in ways that are non-trivial to
debug. I've found that pages which have bogus JavaScript embedded can
cause lots of problems, as well as pages in stranger character
encodings.

    -tree

-- 
Tom Emerson                                          Basis Technology
Corp.
Software Architect
http://www.basistech.com
 "You can't fake quality any more than you can fake a good meal."
(W.S.B.)